ReinV / SCOPE

Search and Chemical Ontology Plotting Environment
Other
1 stars 2 forks source link

Risk of memory error running "make_table.py" #38

Closed ReinV closed 2 years ago

ReinV commented 2 years ago

The "make_table.py" script has high memory usage when it loads and processes all the searching by year files (~7GB). We should look at ways to make this more memory efficient to prefent memory errors.

magnuspalmblad commented 2 years ago

Yes, I still get errors like:

MemoryError: Unable to allocate 11.1 GiB for an array with shape (58, 25599413) and data type object

when reading in the searches_by_year data. I have 64 GB memory on the system though.

ReinV commented 2 years ago

@magnuspalmblad What will now be implemented is that for every file it groups by publication before merging it to the "core" dataframe. I observed memory usage of ~3GB, which is ofcourse much smaller than the 11GB reported here.

However, I was thinking that the result of this merging could just be provided by us in a "statistics file" in stead of recreating this for every make_table.py run. And the script that creates this statistics file should ofcourse be saved but we could add it to the "SCOPE-maintain" project. This way, we take as much trouble from the user (memory wise mostly in this case) and simply provide the statistics to calculate TFIDF. If they want to know how these are obtained, we refer to the SCOPE-maintain projects. Do you see any drawbacks to this plan?

ReinV commented 2 years ago

Script now uses ChEBI2idf file and no longer has memory usage issues.