enzoampil / phcovid

Easy access to updated PH COVID data
MIT License
13 stars 3 forks source link

Changes for faster runtime for get_cases - 55% speedup #20

Closed enzoampil closed 4 years ago

enzoampil commented 4 years ago

To speedup the process of merging the raw json data with the gsheets data source, I increase the usage of pandas when performing operations that can be vectorized.

The following changes were made to speed up get_cases:

  1. extract_dsph_gsheet_data now returns a pandas dataframe straight.
  2. Standardized targets input, defined in constants.py. This means we can add new ghseets columns by editing only GSHEET_TARGET_COLUMNS
  3. supplement_data now performs the gsheets data replacements in a vectorized way with the loc method. Intersections in case_id are also now detected with sets instead of lists (more efficient),
  4. get_cases is now supplements the data before applying the aliasing. This makes sense so the supplemented data can still be aliased.
  5. All the elements in NONE_ALIAS will now be converted to a numpy.nan instead of just a "none" string. This allows us to utiilize the NaN methods in pandas.
  6. Tests were updated to reflect 1

Note: 1 test fails related to phcovid_network.py. I am still working on figuring out how this happened. Hope to work with @andrewnyu to figure this out.

enzoampil commented 4 years ago

tysm for reviewing!