CODAIT / covid-notebooks

Jupyter notebooks that analyze COVID-19 time series data
Apache License 2.0
104 stars 38 forks source link

Run curve-fitting code in parallel using Ray #29

Closed frreiss closed 3 years ago

frreiss commented 3 years ago

The notebook fit_us_data.ipynb is by far the most CPU-intensive portion of our pipeline.

This PR modifies fit_us_data.ipynb to use Ray to run curve-fitting operations in parallel. After this change, the notebook runs about 2-3 times as fast on a laptop. Speedup on larger machines should be more significant.

I also modified requirements.txt to include a dependency on ray and to remove a dependency on a (no longer required) library that conflicts with Ray. I also added text-extensions-for-pandas to requirements.txt and removed the code in env.sh and Dockerfile that was installing that package from Github. And I modified the Dockerfile to read package coordinates from requirements.txt.

I'm also including a small change to the function collapse_time_series() in util.py that speeds up clean_us_data.ipynb and analyze_fit_us_data.ipynb by a significant amount. The change involves using a different Pandas API to iterate over the elements of a DataFrame's index.

review-notebook-app[bot] commented 3 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB