Csvmatrix memory issue - Githubissues

Problem

To create the design matrix we are generating pandas DFs for each feature table that needs to be merged to create the matrix. Pandas DF takes up a lot of memory space which is expensive in terms of the EC2 type instances we are using just to generate the design matrices. If the matrix is big (~30G), we need to use EC2 instances with 512 G RAM.

Solution

The solution is based on the code already developed on branch jsl/matrix_building_memory with other improvements.

Summary of changes:

Instead of creating DF for each feature table to be merged, now CSV files are generated saving memory space.
Instead of merging DF to generate the design matrix we stitch CSV files together using subprocess
We add sanity checks to verify that all the files (features and label) have the same number of rows
We add the entity_id and as_of_date columns to each of the feature and label files generated, we sort them and then remove the entity_id and as_of_date for all files except 1 to verify the same order and elements on each feature and label file.
To read the CSV as DF we use a polars DF (polars) that allows us to reduce the time and memory used to load the matrix from ~1hr to < 3min for a matrix of 37G.
While reading the CSV we do the downcasting of numbers to Float32
We still convert to a pandas DF from a polars DF so that the rest of the process holds
Adequations on how we load a CSV matrix using polars
Adding polars and pyarrow to requirements
Adquations to unit test for stitching the CSV files together to generate the design matrix

dssg / triage

Csvmatrix memory issue #930