Closed GemmaTuron closed 1 year ago
I forked and cloned this model's repo. Before starting to modify the code, I wanted be sure that it works. Unfortunately, it doesn't work. I tried both the local repo and fetching from ersilia model hub but I get Output error (log file So, I'm working on it to ensure that it works by comparing with the original source code.
Perfect, let's talk about it in our meeting tomorrow!
Hi @HellenNamulinda Here is the log file
@HellenNamulinda Have you had time to look into this? Please share the updates here when you have them
@HellenNamulinda Have you had time to look into this? Please share the updates here when you have them
From this particular logfile. the exact file leading to a crash wasn't specified. So what I did is,
I decided to create an env using the requirements specified in the Ersilia’s model repo version. I got a Fatal Python error: Segmentation fault(core dump)
. Using faulthandler, I realized the package that was leading to a crash is torch_geometric
, one of the data loaders used.
I tried to create an env with higher versions, the only challenge now is that the checkpoints are PyTorch Lightning models and were trained using gpu . The model is loaded using model load_from_checkpoint
When I run, I get a RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU
When I change to load it using torch.load(checkpoint_path, map_location=torch.device('cpu'))
, it raises a TypeError: 'model' must be a 'LightningModule' or 'torch._dynamo.OptimizedModule', got 'dict'
. This is because the .ckpt was not saved in a format compatible with torch.load()
I'm still getting the best way to load the gpu trained pytorch lighting model on CPU. I have also thought of retraining.
Hi @HellenNamulinda
The model was running on CPU only, so it shouldn't be a problem that it was trained on GPU. What I suggest is:
Hi @HellenNamulinda
The model was running on CPU only, so it shouldn't be a problem that it was trained on GPU. What I suggest is:
- Understand the architecture of the model: look at the original repository .yml file and compare the installs that we are doing. We need to do the installs for cpu, not gpu, which is also specified in the Dockerfile.
- Identify the right package installation, rather than changing the code, as this will likely give problems as the ones you report.
From the .yml file in the original repository , the pytorch dependencies (from line 142) require cuda (line 37:cudatoolkit=11.3.1=h9edb442_10
) for example line 142 to 148
pytorch=1.11.0=py3.7_cuda11.3_cudnn8.2.0_0
- pytorch-cluster=1.6.0=py37_torch_1.11.0_cu113
- pytorch-lightning=1.3.8=pyhd8ed1ab_0
- pytorch-mutex=1.0=cuda
- pytorch-scatter=2.0.9=py37_torch_1.11.0_cu113
- pytorch-sparse=0.6.14=py37_torch_1.11.0_cu113
- pytorch-spline-conv=1.2.1=py37_torch_1.11.0_cu113
For the repo in Ersilia, our dependencies in the docker file are cpu only.
RUN conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cpuonly -c pytorch
RUN pip install pytorch-lightning==1.3
RUN pip install torch-scatter==2.0.8 torch-sparse==0.6.10 torch-cluster==1.5.9 torch-spline-conv==1.2.1 torch-geometric==2.1.0 -f https://data.pyg.org/whl/torch-1.8.0%2Bcpu.html
I'm yet to find out how the model was able to work initially with the requirements in the docker file because currently, fetching this model using Ersilia CLI fails.
Hi @HellenNamulinda,
Try to reproduce the steps manually (not fetching from Ersilia) but installing the exact requirements indicated in the docker file. IF this fails we will know for sure what is the package that needs updating and we can deal with it!
Hi @HellenNamulinda,
Try to reproduce the steps manually (not fetching from Ersilia) but installing the exact requirements indicated in the docker file. IF this fails we will know for sure what is the package that needs updating and we can deal with it!
That's what I did, Installed the packages in a new conda environment and run the model. That's when I got the error. RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU
I'm going to run the original model on a gpu by manually installing the requirements specified in their .yml file and see if it works on a gpu.
Hi @HellenNamulinda !
I am trying to reproduce your steps so I can provide support but I am unable to get the faulthandler
pointing to the torch geometric issue. Can you please share where did you call the faulthandler
package and paste the output here?
Thanks!
Hello @GemmaTuron, I was able to successfully fetch the model locally after updating the libraries. The docker file run commands were modified to run in the following order.
RUN cconda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cpuonly -c pytorch
RUN pip install numpy
RUN pip install pandas
RUN pip install rdkit-pypi
RUN pip install pytorch-lightning==1.4.5
RUN pip install torchmetrics==0.6.2
RUN pip install torch-geometric==2.1.0
RUN pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.8.0+cpu.html
RUN pip install more-itertools==8.8.0
This is the log file for fetching the model, eos7a45_fetch.log
The predictions on the eml dataset, eml_eos7a45.csv
And prediction on a single molecule; eos7a45-predict-one.log
I pushed the changes and created a pull request here
Note:
Mode size is 429.12996768951416 MB
Hi @HellenNamulinda
I've catched a typo in the dockerfile. Just corrected it, the Actions are running again check them out
I think this worked, so let's close the issue. However, building the docker image took 3h on GitHub Actions. I am re-running the workflows again, just in case.
Hi @HellenNamulinda
Please check that the model is up to date with the new version of the template and all the workflows are in place!