The notebook fit_us_data.ipynb is by far the most CPU-intensive portion of our pipeline.
This PR modifies fit_us_data.ipynb to use Ray to run curve-fitting operations in parallel. After this change, the notebook runs about 2-3 times as fast on a laptop. Speedup on larger machines should be more significant.
I also modified requirements.txt to include a dependency on ray and to remove a dependency on a (no longer required) library that conflicts with Ray. I also added text-extensions-for-pandas to requirements.txt and removed the code in env.sh and Dockerfile that was installing that package from Github. And I modified the Dockerfile to read package coordinates from requirements.txt.
I'm also including a small change to the function collapse_time_series() in util.py that speeds up clean_us_data.ipynb and analyze_fit_us_data.ipynb by a significant amount. The change involves using a different Pandas API to iterate over the elements of a DataFrame's index.
The notebook
fit_us_data.ipynb
is by far the most CPU-intensive portion of our pipeline.This PR modifies
fit_us_data.ipynb
to use Ray to run curve-fitting operations in parallel. After this change, the notebook runs about 2-3 times as fast on a laptop. Speedup on larger machines should be more significant.I also modified
requirements.txt
to include a dependency onray
and to remove a dependency on a (no longer required) library that conflicts with Ray. I also addedtext-extensions-for-pandas
torequirements.txt
and removed the code inenv.sh
andDockerfile
that was installing that package from Github. And I modified theDockerfile
to read package coordinates fromrequirements.txt
.I'm also including a small change to the function
collapse_time_series()
inutil.py
that speeds upclean_us_data.ipynb
andanalyze_fit_us_data.ipynb
by a significant amount. The change involves using a different Pandas API to iterate over the elements of a DataFrame's index.