NREL / gdx-pandas

Python interface to read and write GAMS GDX files using pandas.DataFrames as the intermediate data format.
BSD 3-Clause "New" or "Revised" License
43 stars 16 forks source link

Lazy load should not cache results #56

Closed jebob closed 3 years ago

jebob commented 5 years ago

In my use case, I am trying to pick out some small and large symbols from a very large (3GB) gdx. It is too large to use to_dataframes(). Reading the GDX with lazy load is super slow, so calling to_dataframe() many times is also slow. In the end I settled for creating a Translator object and reusing it, but I discovered that the results are cached, so my large symbols have duplicated dataframes when I do anything with them.

Perhaps we should not store the symbol state by default?

elainethale commented 5 years ago

What particular data do you think should not be stored? I load some data so users can see what is in the file before choosing exactly which symbols to load.

It also sounds like you would like the ability to "unload" a symbol. I could imagine doing that with either an explicit GdxSymbol.unload method, or maybe some sort of __enter__, __exit__ syntax ...

jebob commented 5 years ago

Currently, the Translator object stores the values of the loaded symbol.

The ability to unload a symbol would work.

elainethale commented 3 years ago

gdxpds.read_gdx.Translator.dataframe does not appear to cache the dataframe separately in the Translator object, but it does return a copy. Is the copy problematic?

elainethale commented 3 years ago

I added an unload feature: https://github.com/NREL/gdx-pandas/commit/bee08e1c256ba32e3d37d63e09819dc2f9cc7559. Does that fix the issue?

jebob commented 3 years ago

It does not, I think the problem here is more subtle and not as I originally described.

```python import gdxpds import pandas as pd import time def read_big(): time.sleep(10) # RAM here 876.4MB print("loading") x = gdxpds.to_dataframes("big.gdx") print("loaded") time.sleep(10) # RAM here 1139.9MB del x print("unloaded") time.sleep(10) # RAM here 1117.2MB def make_big(): # This doesn't increase RAM, so pandas is not to blame time.sleep(10) # RAM here 939.1MB print("loading") x = pd.DataFrame({"i": list(range(1000000)), "j": list(range(1000000)), "value": list(5 for _ in range(1000000))}) print("loaded") time.sleep(10) # RAM here 962.1MB del x print("unloaded") time.sleep(10) # RAM here 939.2MB #gdxpds.to_gdx("big.gdx", {"big": x}) read_big() #make_big() ```

I think there is a memory leak, and will close this issue and make a minimum example.