dhimmel / learn

Machine learning and feature extraction for the Rephetio project
https://doi.org/10.15363/thinklab.d210
4 stars 5 forks source link

Kernel dies in all-features/3-extract.ipynb due to excessive RAM usage #3

Closed NuriaQueralt closed 7 years ago

NuriaQueralt commented 7 years ago

Hi Daniel,

i am trying to run the extract notebook in a reduced hetionet without success. The total number of queries is 41,704,686. But, the kernel dies about the 3% of submitted queries. i am working with two workers. Any hint how should i solve this issue will be very welcome.

Many thanks in advance! Nuria

dhimmel commented 7 years ago

Okay so you're getting a dead kernel when you run the all-features/3-extract.ipynb notebook.

It would be great if we could get the error message. Maybe check your jupyter notebook server shell to see if there are any error messages. Another option would be to export the notebook as a script (first remove %%time) and see if the code works as a .py file. Also see if you can monitor your RAM usage... let's make sure it's not a memory leak or overflow -- which I don't think should be happening, but who knows.

dhimmel commented 7 years ago

Are you using conda to manage your environment? Can you comment a bit on your environment.

NuriaQueralt commented 7 years ago

exactly, i am getting a dead kernel when executing all-features/3-extract.ipynb notebook

i have looked for the error message but i haven't been able to find a jupyter or ipython log file in the console.. so i have activated the debug option in the jupyter-notebook-config.py file to true. let's see if now with this option activated a log file is created..

i am using conda with an environment set up using yours: integrate/enrironment.yml

i monitored the RAM usage and i suspect that the problem is not enough RAM... for that reason i am rerunning the calculation in a new computer with more CPU and RAM. i'll let you know how this works! Thanks!

dhimmel commented 7 years ago

i monitored the RAM usage and i suspect that the problem is not enough RAM... for that reason i am rerunning the calculation in a new computer with more CPU and RAM. i'll let you know how this works! Thanks!

Cool. We could also look into your Neo4j config files your using to make sure it's specifying an appropriate amount of RAM for your system. I think by default each Neo4j 2.3 instance will use half of the available RAM or something... so this could become problematic if you have several instances running at once.

i am using conda with an environment set up using yours: integrate/enrironment.yml

Good call. In retrospect, I should have been more explicit about dependencies across all Rephetio repos. Let me know if you ever run into version-related errors using integrate/enrironment.yml, so I can weed the errors out.

dhimmel commented 7 years ago

@NuriaQueralt is this issue resolved? If so can you close.

NuriaQueralt commented 7 years ago

@dhimmel good news! the calculation successfully completed in a new computer with 8 processors, 30G RAM and 16G HDD mem. Also, following your suggestions i optimized the Neo4j memory consumption for each instance. Specifically, i set the page cache memory and the Java Heap size according to the total size of my Neo4j databse store files. Not sure if this optimization was key, but definitely it helped. I am going to rerun it again without this optimization step to see its significance. Thanks for your ideas!

I am testing the rephetio workflow using a very minimal version of hetionet. i am wondering what were your computer features (CPU, RAM, etc..) you run your experiment with the full hetionet (v2.0 if i remember correctly..). This will be helpful to prepare appropiately for a bigger network..

dhimmel commented 7 years ago

what were your computer features (CPU, RAM, etc..)

16 3.2-GHz cores, 256 GB of memory, and 4.25 TB of storage (4 TB on hard disk and 256 GB on solid state)(source). Although you could probably get by with less.

the full hetionet (v2.0 if i remember correctly..)

In dhimmel/integrate, the final database name/version was labeled rephetio-v2.0. However, this is actually the same as hetionet-v1.0. This occurred since originally we used rephetio as the name for both the project and the network. But then we decided it would be a good idea to give the network a separate name, hence Hetionet was adopted. So rephetio-v2.0_perm-1 is actually the same thing as hetionet-v1.0-perm-1 here.

NuriaQueralt commented 7 years ago

Thanks for all the information and clarification! It is very appreciated!