kernel dies when doing model search with 1B_Model_Searching.ipynb

PatWalters commented 2 years ago

I did a fresh install, and the kernel dies when I try to run mm.run(models)

This is with Ubuntu 18.04.6 LTS I installed with conda create -n oce python=3.8 conda activate oce bash <(curl -s https://raw.githubusercontent.com/Oloren-AI/olorenchemengine/master/install.sh) conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html

davidzqhuang commented 2 years ago

Looking into it

raunakdoesdev commented 2 years ago

Thanks for the report. Was able to replicate your 18.04 environment exactly in a docker image. It seems like during the training of one of the graph neural network models (the fourth one) the pytorch geometric data loader hangs after a few batches. We'll dig into the details more tomorrow.

My hunch is that it's some kind of problem with pytorch geometric's multiprocessing loader - will keep you updated.

PatWalters commented 2 years ago

This example segfaults with just the first model. A2a.smi.txt

PatWalters commented 2 years ago

oce_model_search.py.txt

PatWalters commented 2 years ago

Remove .txt extensions from the files above

raunakdoesdev commented 2 years ago

Hey Pat!

I tried to replicate your case with a Docker environment and relevant files, but in this containerized environment - it was able to run through to completion without any errors with the files you sent over. Do you think you could provide some additional information on where the segmentation fault is occurring in your case and any additional error logs you might be running into.

Below is the docker image I used to try and recreate your environment (Ubuntu 18.04 NVIDA Runtime Image).

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu18.04

# Basic prereqs
RUN apt-get update; apt-get install curl bash git build-essential libxrender1 libxtst6 -y

# Install Anaconda3
ENV PATH="/root/miniconda3/bin:${PATH}"
ARG PATH="/root/miniconda3/bin:${PATH}"
RUN apt-get install -y wget && rm -rf /var/lib/apt/lists/*
RUN wget \
    https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    && mkdir /root/.conda \
    && bash Miniconda3-latest-Linux-x86_64.sh -b \
    && rm -f Miniconda3-latest-Linux-x86_64.sh

# Set the conda version!
RUN conda --version
RUN conda install python=3.8

# Install
RUN conda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch
RUN conda install -y pyg -c pyg
RUN pip install olorenchemengine[full]

# Add optional open pyxl dependency (for reading excel)
RUN pip install openpyxl

# Copy files over
WORKDIR /reproduce
COPY . .
CMD ["python", "oce_model_search.py"]

PatWalters commented 2 years ago

I followed all of your steps and ended with the shared library issue I ran into earlier.

conda create -n oce python=3.8 conda activate oce conda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch conda install -y pyg -c pyg pip install olorenchemengine[full] pip install openpyxl ./oce_model_search.py

Traceback (most recent call last): File "./oce_model_search.py", line 3, in import olorenchemengine as oce File "/home/pwalters/anaconda3/envs/oce/lib/python3.8/site-packages/olorenchemengine/init.py", line 70, in import(imp) File "/home/pwalters/anaconda3/envs/oce/lib/python3.8/site-packages/torch_geometric/init.py", line 4, in import torch_geometric.data File "/home/pwalters/anaconda3/envs/oce/lib/python3.8/site-packages/torch_geometric/data/init.py", line 1, in from .data import Data File "/home/pwalters/anaconda3/envs/oce/lib/python3.8/site-packages/torch_geometric/data/data.py", line 20, in from torch_sparse import SparseTensor File "/home/pwalters/anaconda3/envs/oce/lib/python3.8/site-packages/torch_sparse/init.py", line 19, in torch.ops.load_library(spec.origin) File "/home/pwalters/anaconda3/envs/oce/lib/python3.8/site-packages/torch/_ops.py", line 220, in load_library ctypes.CDLL(path) File "/home/pwalters/anaconda3/envs/oce/lib/python3.8/ctypes/init.py", line 373, in init self._handle = _dlopen(self._name, mode) OSError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory

raunakdoesdev commented 2 years ago

Thanks for sticking with us. I checked the Pytorch Geometric docs and it seems like the solution to this error might be to run your program with the LD_LIBRARY_PATH set which apparently Conda doesn't do by default when activating a new environment (weird! - https://github.com/conda/conda/issues/308).

For you this command should look like: export LD_LIBRARY_PATH="/home/pwalters/anaconda3/envs/oce/lib"

Then you can run the program with: ./oce_model_search.py

Let me know if this resolves the issue. If it does, we'll make sure to make it clear in our installation FAQs.

PatWalters commented 2 years ago

Please ignore the previous email. The cache on my "locate"command was out of date.

I do this: find /home/pwalters/anaconda3/envs/oce/ -name libtorch_cuda_cu.so -print

the file isn't in that directory. It isn't being installed.

davidzqhuang commented 2 years ago

This seems like may even be hardware/driver-specific (GPU + CUDA driver compatibility) then, without more knowledge of the system it's a bit difficult to assess how to proceed.

The below commands allow for the successful installation onto a machine we have which is ubuntu 18.04 and has cuda version 11.3, what CUDA driver do you have?

In essence, we think that the CUDA versions used by the driver/torch/torch-geometric are incompatible somewhere. Our usual solution is to then directly specify the CUDA driver version in the below examples (and that is what is replicated in the Docker image Raunak had shared).

If that doesn't work, if we know the CUDA driver we may be able to re-replicate the issue and resolve it that way. If it doesn't we can offer to come in (however is best) and directly examine the machine.

conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html

davidzqhuang commented 2 years ago

P.S. is there any other way in which the current machine differs from the machine you'd described in https://github.com/Oloren-AI/olorenchemengine/issues/48.

That differential may help diagnose the issue

PatWalters commented 2 years ago

This is the same machine I described in #48, just a fresh install. Unfortunately, I deleted the previous install.

nvidia-smi
Fri Nov 11 12:44:38 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Graphics...  On   | 00000000:18:00.0 Off |                  Off |
| 41%   34C    P8    13W / 140W |     84MiB / 16108MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1477      G   /usr/lib/xorg/Xorg                 64MiB |
|    0   N/A  N/A      1560      G   /usr/bin/gnome-shell               17MiB |
+-----------------------------------------------------------------------------+

PatWalters commented 2 years ago

I'm going to try this with docker, it may be simpler for everyone involved.

davidzqhuang commented 2 years ago

Yeah, that sounds great. In the meantime, we will try to procure a similar machine and get this error replicated although that will take an undetermined amount of time. Let's keep each other posted!

PatWalters commented 2 years ago

Just to confirm the recommended installation recipe.

conda create -n oce python=3.8 conda activate oce conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html pip install olorenchemengine[full]

Correct?

davidzqhuang commented 2 years ago

Correct, qualified that this is for CUDA 11.3.

To break it down conda create -n oce python=3.8 is to get a stable version of python for pytorch/torchgeometric

conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch is the official recommended installation for pytorch

pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html is the (historical) official recommended build for torch-geometric

pip install olorenchemengine[full] is the installation for OCE.

Oloren-AI / olorenchemengine

kernel dies when doing model search with 1B_Model_Searching.ipynb #63