ReinV / SCOPE

Search and Chemical Ontology Plotting Environment
Other
1 stars 2 forks source link

Possible memory allocation issue #41

Closed magnuspalmblad closed 2 years ago

magnuspalmblad commented 2 years ago

When running a search for COVID-19 related papers published in the last two years (search string "(FIRST_PDATE:[2020-01-01 TO 2022-12-31]) AND ("2019-nCoV" OR "2019nCoV" OR "COVID-19" OR "SARS-CoV-2" OR ("wuhan" AND "coronavirus") OR "Coronavirus" OR "Corona virus" OR "corona-virus" OR "corona viruses" OR "coronaviruses" OR "SARS-CoV" OR "Orthocoronavirinae" OR "MERS-CoV" OR "Severe Acute Respiratory Syndrome" OR "Middle East Respiratory Syndrome" OR ("SARS" AND "virus") OR "soluble ACE2" OR ("ACE2" AND "virus") OR ("ARDS" AND "virus") or ("angiotensin-converting enzyme 2" AND "virus"))") resulting in ~360,000 hits, make_table.py runs out of memory. It cannot allocate 16 GB, even if my 64 GB workstation has almost 60 GB free.

This is the output from SCOPE:

(base) G:\Projects\Reinier Vleugels\SCOPE-master_2022>python make_table.py -i results -t folder
getting searches by year ...
sys:1: DtypeWarning: Columns (1) have mixed types.Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
  File "make_table.py", line 160, in <module>
    main()
  File "make_table.py", line 131, in main
    df_sby = read_searches_by_year()
  File "make_table.py", line 21, in read_searches_by_year
    df = pd.concat([df, df_new])
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 281, in concat
    sort=sort,
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 360, in __init__
    obj._consolidate(inplace=True)
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5365, in _consolidate
    self._consolidate_inplace()
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5347, in _consolidate_inplace
    self._protect_consolidate(f)
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5336, in _protect_consolidate
    result = f()
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5345, in f
    self._data = self._data.consolidate()
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 940, in consolidate
    bm._consolidate_inplace()
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 945, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1887, in _consolidate
    list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 3096, in _merge_blocks
    new_values = new_values[argsort]
MemoryError: Unable to allocate 16.5 GiB for an array with shape (62, 35801312) and data type object

(base) G:\Projects\Reinier Vleugels\SCOPE-master_2022>

Any ideas why or how to fix this?

magnuspalmblad commented 2 years ago

I could add that I have the individual years in the "searches_by_year" folder, not the decade ones.

magnuspalmblad commented 2 years ago

I see now that I get this error also with other and smaller searches, such as our example "APCI" and "HILIC"...

ReinV commented 2 years ago

That makes sense because my expectation is that this is caused by the "searches_by_year" files. Do you use all of them? You can always try to use only one decade (the most recent one for example).

Obviously we should make sure "make_table.py" can run with all searches by year files. But for now you can try to see if less files works.

ReinV commented 2 years ago

"Searches_by_decade" are more concise and therefore more memory efficient for "make_table.py". I'm not sure if I ever tested the newest "make_table.py" script using the "searches_by_year" files.

See https://osf.io/cfjde/ for link to 2010-2019 searches by decade file.

magnuspalmblad commented 2 years ago

I replaced the individual years with the decades, but then get this error:

(base) G:\Projects\Reinier Vleugels\SCOPE-master_2022>python make_table.py -i results -t folder
getting searches by year ...
Traceback (most recent call last):
  File "make_table.py", line 160, in <module>
    main()
  File "make_table.py", line 135, in main
    data = import_properties()
  File "make_table.py", line 60, in import_properties
    df['ChEBI'] = df['ChEBI'].astype(int)
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2773, in __getitem__
    if self.columns.is_unique and key in self.columns:
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
  File "pandas\_libs\properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__
  File "C:\Users\nmpalmblad\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'

(base) G:\Projects\Reinier Vleugels\SCOPE-master_2022>ls searches_by_year
1940-1949_ChEBI_IDS.tsv  1980-1989_ChEBI_IDS.tsv  2020-2029_ChEBI_IDs.tsv
1950-1959_ChEBI_IDS.tsv  1990-1999_ChEBI_IDS.tsv  pre1945_ChEBI_IDs.tsv
1960-1969_ChEBI_IDS.tsv  2000-2009_ChEBI_IDS.tsv
1970-1979_ChEBI_IDS.tsv  2010-2019_ChEBI_IDs.tsv

(base) G:\Projects\Reinier Vleugels\SCOPE-master_2022>
ReinV commented 2 years ago

This error is caused when processing the files from the "files" folder. Could you first check if you got all the recent files from the OSF storage?

magnuspalmblad commented 2 years ago

Yes, I get the exact same error with the files in the OSF in the "files" folder.

ReinV commented 2 years ago

I cannot reproduce this error, so before making you add lines to print stuff, can you also check 1) if you have the latest "make_table.py" script version and 2) if your pandas package is updated (pip install --upgrade pandas).

ReinV commented 2 years ago

Still the same issue?

ReinV commented 2 years ago

Updating pandas solved this issue.