Closed ReinV closed 2 years ago
Yes, I still get errors like:
MemoryError: Unable to allocate 11.1 GiB for an array with shape (58, 25599413) and data type object
when reading in the searches_by_year data. I have 64 GB memory on the system though.
@magnuspalmblad What will now be implemented is that for every file it groups by publication before merging it to the "core" dataframe. I observed memory usage of ~3GB, which is ofcourse much smaller than the 11GB reported here.
However, I was thinking that the result of this merging could just be provided by us in a "statistics file" in stead of recreating this for every make_table.py run. And the script that creates this statistics file should ofcourse be saved but we could add it to the "SCOPE-maintain" project. This way, we take as much trouble from the user (memory wise mostly in this case) and simply provide the statistics to calculate TFIDF. If they want to know how these are obtained, we refer to the SCOPE-maintain projects. Do you see any drawbacks to this plan?
Script now uses ChEBI2idf file and no longer has memory usage issues.
The "make_table.py" script has high memory usage when it loads and processes all the searching by year files (~7GB). We should look at ways to make this more memory efficient to prefent memory errors.