The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
With 25 years of FERC Form 1 data, the similarity matrix and/or feature matrix that's used to determine which records are associated with each other has become very large. The process now uses a peak of about 24GB of memory, which means the process is impossible to run on a typical laptop.
We should look at ways to reduce the memory footprint. One option is to use a sparse matrix representation of the similarity matrix, dropping all values below a certain threshold (0.7 or whatever), which will dramatically reduce the number of values that need to be stored.
However, we should do some memory use profiling to see which step is actually blowing up the memory usage before investing a bunch of time in any particular piece of it.
With 25 years of FERC Form 1 data, the similarity matrix and/or feature matrix that's used to determine which records are associated with each other has become very large. The process now uses a peak of about 24GB of memory, which means the process is impossible to run on a typical laptop.
We should look at ways to reduce the memory footprint. One option is to use a sparse matrix representation of the similarity matrix, dropping all values below a certain threshold (0.7 or whatever), which will dramatically reduce the number of values that need to be stored.
However, we should do some memory use profiling to see which step is actually blowing up the memory usage before investing a bunch of time in any particular piece of it.