dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
181 stars 61 forks source link

Csvmatrix memory issue #930

Closed silil closed 9 months ago

silil commented 1 year ago

Problem

To create the design matrix we are generating pandas DFs for each feature table that needs to be merged to create the matrix. Pandas DF takes up a lot of memory space which is expensive in terms of the EC2 type instances we are using just to generate the design matrices. If the matrix is big (~30G), we need to use EC2 instances with 512 G RAM.

Solution

The solution is based on the code already developed on branch jsl/matrix_building_memory with other improvements.

Summary of changes: