Develop Work Plan - Githubissues

lossyrob commented 4 years ago

In order to maximize our urgent time and those that want to collaborate on this project, we should have a clear work plan of work that can be parallelized and combine to reach the project's goals. This issue outlines a broad approach to the work, and is an iteration on Dave's How to Help section. The work plan here will be translated into issues that an individual interested in helping out can take on or contribute to effectively.

Some high level goals of the project as I understand them so far:

A. Publish a public dataset that describes the US healthcare system capacity.
B. Perform an analysis of the healthcare system capacity as compared to disease spread forecasts and publish data about where there is a capacity gap.
C. Visualize the result of the analysis as well as the disease forecast and healthcare system capacity data in a format that supports healthcare system preparedness and resourcing decisions.

In order to make that happen, I can see work being split up into 4 groupings: Data Gathering, Cleaning & Cataloging, Data Analytics, Visualization, and Project Direction. Someone may choose to work on tasks that contribute to multiple work groups at once; this is only meant to be a logical division of the work that will help detail and potentially parallelize components of the project that will combine to accomplish the project goals.

Data Gathering, Cleaning & Cataloging

This group of work includes:

Contributing to a document that lists datasets that may be helpful to this effort. This will be a staging ground for exploration and description of datasets that could be useful. If there is already an effort that does this step, perhaps we can just document the best way to utilize it. Data should be cataloged by the potential need it fills.
Exploration of the datasets in this document that results in a more detailed description of how to access it, what cleanup needs to be done, what the quality of the data is, etc. This will give information to the data publishing and analytics efforts to quickly be able to utilize relevant datasources.
Data cleaning, combining and formatting: Jupyter notebooks, python scripts, or other mechanisms to download and clean the data and combine it usefully so that it can be used in the data analytics of this project or for direct publishing for other efforts. For example, combining the current county-level GeoJSON data with other datasets like covid case counts by day/county/state from https://covidtracking.com/
Data Organization, Documentation and Cataloging: This could be as simple as a markdown file in this repository with links to data files along with their description. Whatever the best way is to present cleaned datasources to other folks on this project as well as outside of this project. This includes data produced by Data Analytics of this effort (e.g. estimated hospital system capacity).
Data Access: What is the best way to access the data we collect or generate? Perhaps a pip-installable python package? Other efforts have exposed data through an API. Perhaps if someone is motivated to allow for this API access, this could be a good task; this feels like a more "nice to have" feature as CSVs hosted on GitHub get us pretty far.
Machine Learning Dataset: This is another "nice to have" or stretch goal activity - Collecting a supervised training and validation dataset based on relevant and high-value tasks would be a way to solicit involvement from ML communities to apply modeling. For instance, generating a feature set per region that has a series of population and demographic features matched with test results over time may present a modeling opportunity for prediction of disease spread per region over time. If such a dataset was generated and we had reason to believe modeling efforts might yield good results, we could try to run a "competition" style call to action for ML participants.

Data Analytics

Here are three data analytics work streams I see so far:

Estimated Hospital System Capacity

This is the analysis that exists in this repository now. Can we make this estimation better? Is this the right level of detail (county level)?

Since this has been identified as a dataset gap in the community, once determined of a sufficient quality the product of the analysis should be published as a dataset published by this project. This would accomplish project goal A.

Epidemiological Modeling

In order to perform the comparison analysis which identifies the care gap we need an estimation of effected population over time. More specifically, we want to know the projected number of active cases putting demand on the healthcare system in different locations at different times.

There are several open source approaches to this type of modeling, and the ideal case is to reuse other's work. For instance, perhaps there is an implementation of a SIR model that we could run at a county level based on census demographics to generate a per-county per-timepoint health system stress dataset. Or perhaps there is already someone publishing modeling data at an appropriate aggregation for our analysis that we can just use directly.

A stretch goal would be to generate a ML challenge or competition that could take advantage of community participation to develop a more accurate model. This would rely on the ability to develop the supervised training dataset mentioned in the Data Gathering, Cleaning & Cataloging work group above.

Comparing Capacity vs Forecasts

Once we have a dataset that estimates health system capacity, and the ability to forecast stress over time on the healthcare system, we will be able to identify care gaps. This analysis would seek to help answer the questions Dave posted in the README:

What is the actual critical care capacity in each city or region? How much is that capacity ramping in preparation? How much past 100% capacity will the demand be? How close (or past) the breaking point will we go? How do we minimize this gap as much and as proactively as possible?

This would accomplish project goal B. It is dependent on the ability to produce Epidemiological Models sufficient for the analysis, as well as the Estimated Hospital System Capacity dataset being generated.

Visualization

Answering questions about the healthcare system's capacity and it's ability to handle the stresses of the COVID19 outbreak are only as good as the ability to communicate those answers effectively. The data visualization component aims to make compelling visualizations that communicate the information that is important and actionable. While information can be used to help the crisis, it can also add to the noise, increase panic, and otherwise be unhelpful.

This would accomplish project goal C.

Project Direction

Besides building the visualizations and the data that powers them, we will need people to test, validate, document, and determine the usefulness of the tools generated by this project. We also need people who will be able to explain what these tools are and why they are important for personal & community protection and public health decision making at the local, county & state levels. Also, we need people to connect to other open source and open data efforts so that we are contributing to the larger community effort and not duplicating work unnecessarily.

daveluo commented 4 years ago

Thanks @lossyrob, this looks good and covers the bases enough for us to get going on parallelizing work. We can always refine and remix the groupings as we learn more from validation research and start trying to put things together.

I presume now we can start defining subtasks within each grouping? If so, can you demonstrate with a few example tasks how that should be written out and organized within the project? Then I and others can follow your lead.