Clean UP & Dockerization of eos7a45

GemmaTuron commented 1 year ago

Hi @HellenNamulinda

Please check that the model is up to date with the new version of the template and all the workflows are in place!

HellenNamulinda commented 1 year ago

I forked and cloned this model's repo. Before starting to modify the code, I wanted be sure that it works. Unfortunately, it doesn't work. I tried both the local repo and fetching from ersilia model hub but I get Output error (log file So, I'm working on it to ensure that it works by comparing with the original source code.

GemmaTuron commented 1 year ago

Perfect, let's talk about it in our meeting tomorrow!

GemmaTuron commented 1 year ago

Hi @HellenNamulinda Here is the log file

out.log

GemmaTuron commented 1 year ago

@HellenNamulinda Have you had time to look into this? Please share the updates here when you have them

HellenNamulinda commented 1 year ago

@HellenNamulinda Have you had time to look into this? Please share the updates here when you have them

From this particular logfile. the exact file leading to a crash wasn't specified. So what I did is,

I cloned the original repo to the test the original model. But this had package conflicts when creating the conda env from the yml file provided. So it wasn't much helpful.
I decided to create an env using the requirements specified in the Ersilia’s model repo version. I got a Fatal Python error: Segmentation fault(core dump). Using faulthandler, I realized the package that was leading to a crash is torch_geometric, one of the data loaders used.
- From most of the suggestions, segfaults are caused by building with an older version of gcc. And this requires to reinstall higher versions.
I tried to create an env with higher versions, the only challenge now is that the checkpoints are PyTorch Lightning models and were trained using gpu . The model is loaded using model load_from_checkpoint When I run, I get a RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU When I change to load it using torch.load(checkpoint_path, map_location=torch.device('cpu')), it raises a TypeError: 'model' must be a 'LightningModule' or 'torch._dynamo.OptimizedModule', got 'dict'. This is because the .ckpt was not saved in a format compatible with torch.load()

I'm still getting the best way to load the gpu trained pytorch lighting model on CPU. I have also thought of retraining.

GemmaTuron commented 1 year ago

Hi @HellenNamulinda

The model was running on CPU only, so it shouldn't be a problem that it was trained on GPU. What I suggest is:

Understand the architecture of the model: look at the original repository .yml file and compare the installs that we are doing. We need to do the installs for cpu, not gpu, which is also specified in the Dockerfile.
Identify the right package installation, rather than changing the code, as this will likely give problems as the ones you report.

HellenNamulinda commented 1 year ago

Hi @HellenNamulinda

The model was running on CPU only, so it shouldn't be a problem that it was trained on GPU. What I suggest is:

Understand the architecture of the model: look at the original repository .yml file and compare the installs that we are doing. We need to do the installs for cpu, not gpu, which is also specified in the Dockerfile.

Identify the right package installation, rather than changing the code, as this will likely give problems as the ones you report.

From the .yml file in the original repository , the pytorch dependencies (from line 142) require cuda (line 37:cudatoolkit=11.3.1=h9edb442_10) for example line 142 to 148

pytorch=1.11.0=py3.7_cuda11.3_cudnn8.2.0_0
  - pytorch-cluster=1.6.0=py37_torch_1.11.0_cu113
  - pytorch-lightning=1.3.8=pyhd8ed1ab_0
  - pytorch-mutex=1.0=cuda
  - pytorch-scatter=2.0.9=py37_torch_1.11.0_cu113
  - pytorch-sparse=0.6.14=py37_torch_1.11.0_cu113
  - pytorch-spline-conv=1.2.1=py37_torch_1.11.0_cu113

For the repo in Ersilia, our dependencies in the docker file are cpu only.

RUN conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cpuonly -c pytorch
RUN pip install pytorch-lightning==1.3
RUN pip install torch-scatter==2.0.8 torch-sparse==0.6.10 torch-cluster==1.5.9 torch-spline-conv==1.2.1 torch-geometric==2.1.0 -f https://data.pyg.org/whl/torch-1.8.0%2Bcpu.html

I'm yet to find out how the model was able to work initially with the requirements in the docker file because currently, fetching this model using Ersilia CLI fails.

GemmaTuron commented 1 year ago

Hi @HellenNamulinda,

Try to reproduce the steps manually (not fetching from Ersilia) but installing the exact requirements indicated in the docker file. IF this fails we will know for sure what is the package that needs updating and we can deal with it!

HellenNamulinda commented 1 year ago

Hi @HellenNamulinda,

Try to reproduce the steps manually (not fetching from Ersilia) but installing the exact requirements indicated in the docker file. IF this fails we will know for sure what is the package that needs updating and we can deal with it!

That's what I did, Installed the packages in a new conda environment and run the model. That's when I got the error. RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU

I'm going to run the original model on a gpu by manually installing the requirements specified in their .yml file and see if it works on a gpu.

GemmaTuron commented 1 year ago

Hi @HellenNamulinda !

I am trying to reproduce your steps so I can provide support but I am unable to get the faulthandlerpointing to the torch geometric issue. Can you please share where did you call the faulthandler package and paste the output here? Thanks!

HellenNamulinda commented 1 year ago

Hello @GemmaTuron, I was able to successfully fetch the model locally after updating the libraries. The docker file run commands were modified to run in the following order.

RUN cconda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cpuonly -c pytorch
RUN pip install numpy 
RUN pip install pandas 
RUN pip install rdkit-pypi
RUN pip install pytorch-lightning==1.4.5  
RUN pip install torchmetrics==0.6.2
RUN pip install torch-geometric==2.1.0
RUN pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.8.0+cpu.html
RUN pip install more-itertools==8.8.0

This is the log file for fetching the model, eos7a45_fetch.log

The predictions on the eml dataset, eml_eos7a45.csv

And prediction on a single molecule; eos7a45-predict-one.log

I pushed the changes and created a pull request here

Note:

The model is very heavy Mode size is 429.12996768951416 MB
Running predictions on the eml dataset took appx an hour

GemmaTuron commented 1 year ago

Hi @HellenNamulinda

I've catched a typo in the dockerfile. Just corrected it, the Actions are running again check them out

miquelduranfrigola commented 1 year ago

I think this worked, so let's close the issue. However, building the docker image took 3h on GitHub Actions. I am re-running the workflows again, just in case.

ersilia-os / eos7a45

Clean UP & Dockerization of eos7a45 #9