Test last years project

simonmoedinger commented 1 year ago

Steps:

in folder "sc2cl": pip install -r ontology-mapper/last_year_project/requirements.txt 1.2 install all packages that could not be installed
Run extract.py --> which creates an outfile.csv from CL ontology from obolibrary.org
Run generator.py: -n --bert biobert "FULL-path-to-csv/sc2cl/ontology-mapper/last_year_project/outfile.csv" 3.1 install all packages that could not be installed

NOT Able to run the bert model..

...

simonmoedinger commented 1 year ago

Error step 3:

@slobentanzer do you have a solution for that?

slobentanzer commented 1 year ago

@simonmoedinger Can you please, also for the future, copy and paste the stack trace? Skewed mobile phone pictures of a computer screen are not very readable. Seems like a memory issue, maybe? Did you try asking ChatGPT?

slobentanzer commented 1 year ago

I found some info in one of last year's email, added them to the readme of the subdirectory.

RM7 commented 1 year ago

I run the generator.py command. But when it reaches around 93% it crashes with an error:

RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 319488 bytes.

Im running it on my PC with enough disk space and 16GB RAM.

slobentanzer commented 1 year ago

in some way you are running out of memory. what have you tried so far to narrow down the cause? IIRC the tensors were concatenated in some approaches, maybe that's too memory intensive for your machine. are all 16GB available to the process?

RM7 commented 1 year ago

All 16GB should be available. Maybe someone with more RAM could run it and push the result.

simonmoedinger commented 1 year ago

same situation with my notebook..

slobentanzer commented 1 year ago

Hi René @cieldeville, do you know what's happening here? Can you help?

cieldeville commented 1 year ago

Hi Sebastian,

happy to see the project is being used!

As for the error above: I just ran the exact command from the screenshot above, only changing the input .csv file to an older copy of the obo ontology I still had from last year. That version contained 2479 words and completed successfully, indicating to me, that most likely this is simply an out-of-memory error.

While running the generator.py I did, however, notice a strong spike in memory usage at some point. I do not remember that being the case when I used this software last year. Since it will download the model and its configuration from the internet to embed words it might be worth looking into whether some hidden configuration may have changed since.

Alternatively, the code should be easily extendable to use python iterables instead of arrays to perhaps free up some memory on the go. Additionally, changing the generated embedding by specifying a list of embeddings to generate might help narrow things down (using the --embeddings parameter. The default value is concat_four_hidden_mean ; which is indeed be the largest embedding to be generated at around ~85MB for the final .npy file on my system).

I hope this somewhat helps!

Best regards, René

slobentanzer commented 1 year ago

Hi René, that was my suspicion as well, thanks for the quick answer! Hope all is going well with your studies. :)

RM7 commented 1 year ago

Hey,

thank you for your response.

We try playing with the --embeddings parameter but it does not solve the problem. Even when using the cluster in Heidelberg (bwvisu) with the maximum amount of RAM of 50GB the process is killed at 82% independent from the --embeddings parameter.

According to the python iterables you already used lists instead of arrays. So we did not change anything.

We still do not know why it is not working and since it is not working with 50GB of RAM there must be a serios error in the code.

cieldeville commented 1 year ago

Hello RM7,

I am sorry to hear you're still experiencing issues with the software.

While I did not have a lot of time, I did try and run the software with the latest csv file produced by extract.py. Additionally, I ran some basic memory profiling using tracemalloc and the cprofile python module to see if I could find any evidence as to what might be causing the rise in memory usage. Here's what I found:

Running generator.py on the latest csv-file produced by extract.py allows me to reproduce your error from above.

Outputting the top-10 lines that allocate the most memory using tracemalloc and outputting them every 100 words consistently showed mostly the transformers library at the top of the chain see this gist for an excerpt sample log. This is not surprising since it hosts the actual model after all.

Using cProfile and plotting the resulting callchain traces reveals that the line consistently causing the most memory allocations according to tracemalloc is a line which gets executed by the internal transformers module: 16_OntologyMapping1

I currently see two reasons this could be the case:

Perhaps my code creates unnecessary state objects each iteration. While certainly not unlikely, I do, however, have some reservations on this: since creation of any stateful objects from the transformers library requires some amount of configuration to be passed into the respective constructors I would have to pass such configuration around and make use of it in loops. This is, as far as I could remember and see when looking through the code again, not the case.
The transformers module is responsible for the large amounts of memory allocation.

As you can guess, there is little to nothing I can do about No. 2. Like I outlined in my first post, it is possible that some hidden configuration changed at some point, e.g. default values or values that the model that is being downloaded on application startup is now using different parameters than it used to, thereby resulting in new usage patterns which the code from a year ago has not adapted.

Something that at first glance appears to corroborate this intuition: when I run the same commands specifying the scibert model instead of the biobert model, the code runs cleanly, I do not see the memory spikes I saw before and the output is produced cleanly without any errors.

Lastly, I would like to point out that talking of a "serious error in the code" without providing additional information or at the very least showing to have done some investigation yourself is somewhat lackluster. I am afraid, I will not be able to dive back into this project at the moment, but I do hope my findings above will prove helpful to you. I would suggest checking the biobert and transformers project pages to see, if there is any mention of configuration parameters that may have changed over the past year or perhaps if anyone else was facing similar memory issues.

Good luck with the project! René

cieldeville commented 1 year ago

One additional addendum: if the reason for the memory usage was in fact to be found within the project's code and not within the transformers library, I would assume that at some point a line from the project's code would overtake it in terms of memory allocations. This did not happen when I ran the sciBERT generator at any point (see this gist). Of course, this does by no means rule out that perhaps the library is being invoked incorrectly and a lot of state is being recreated on each invocation - essentially relating to item no. 1 from above.

Additionally, it appears that numpy is using its own allocator hence np arrays do not show up on tracemalloc. Since the only numpy arrays being created before the software crashes in the error above stem from within the transformers library (or more precisely from converting the tensors returned from the library to numpy arrays) the points above should still stand.

Since all of this was done rather quickly, I could have easily overlooked something important. So do take all of the above with a grain of salt.

slobentanzer commented 1 year ago

Thanks for your time, @cieldeville!

This sounds like a reproducibility problem. However, there is a detailed requirements.txt that should serve to keep the versions stable. How could the Transformer interface change despite keeping the same version? We do pin version 4.20.1...

cieldeville commented 1 year ago

I think it might not be transformers per se but possibly whatever data transformers downloads on-the-fly to run the BioBERT model. The model in turn is then run through transformers.

slobentanzer commented 1 year ago

Makes sense. But the models are also versioned. Any way to pin the version that was current when you first implemented it?

simonmoedinger commented 1 year ago

Thank you for your answers. Due to time constraints and the fact that ccb worked well, we did not pursue the problem any further.

biocypher / sc2cl

Test last years project #7