catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Reduce memory usage of FERC Plant ID Assignment #475

Closed zaneselvans closed 5 months ago

zaneselvans commented 4 years ago

With 25 years of FERC Form 1 data, the similarity matrix and/or feature matrix that's used to determine which records are associated with each other has become very large. The process now uses a peak of about 24GB of memory, which means the process is impossible to run on a typical laptop.

We should look at ways to reduce the memory footprint. One option is to use a sparse matrix representation of the similarity matrix, dropping all values below a certain threshold (0.7 or whatever), which will dramatically reduce the number of values that need to be stored.

However, we should do some memory use profiling to see which step is actually blowing up the memory usage before investing a bunch of time in any particular piece of it.

zaneselvans commented 8 months ago

@zschira when you get the new resource efficient ferc-to-ferc matching in I think can close this classic issue!