dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
181 stars 61 forks source link

Csvmatrix memory issue #928

Closed silil closed 1 year ago

silil commented 1 year ago

Problem

We've been having issues with the handling of big matrices in Triage. We identified that when creating the design matrix with all the features and labels, we were creating intermediary pandas data frames that were consuming too much memory, to the point that we were using EC2 instances with at least 192 G in RAM for dealing with the building of the matrix design.

What has been done

This pull request addresses the handling of the creation of the building matrix without creating intermediary data frames, but generating CSV files for each table of features generated, then merging (stitching) all the CSV files into one CSV that has the entity id, as of dates, features and the label (in this specific order) and eliminating the temporary CSV files generated in the process -improvements can be done in the creation and deletion of the CSV files by using temporary CSV objects-.

We still return a data frame at the end, that still needs to use a lot of memory, but not as we unnecessarily were using it before.

I updated the unit tests associated with the building of the matrix and the new functions associated with it.

silil commented 1 year ago

we are adding polars to the solution