UT-Covid / episimlab

Framework for development of epidemiological models
https://ut-covid.github.io/episimlab/
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

Faster partitioning with dask and xarray #24

Closed kellypierce closed 3 years ago

kellypierce commented 3 years ago

There were two key bottlenecks in the Partition workflow, resolved as follows:

  1. probabilistic_partition() iterated over rows in a pandas data frame -- much slower than column-wise operations. Refactoring to use a series of data frame merges in pandas_partition() and dask_partition() enables use of faster column-wise operations. The run_step() method is currently coded to use dask_partition(), which is fast enough to handle the larger Safegraph data. The pandas_partition() implementation would be preferred for smaller datasets.

  2. contact_matrix() also iterates over rows in a pandas data frame to build an xarray. New method build_contact_xr() uses the pandas to_xarray() method to convert a multi-indexed pandas df directly to an xarray (with a bit of tidying up for conformance with test cases).

A rough bench mark is 90 min to partition one day of Safegraph data with the old probabilistic_partition() + contact_matrix() workflow, compared to 27.5 seconds with the dask_partition() + build_contact_xr() workflow on a macbook.