Open PatWalters opened 2 years ago
Looking into it
Thanks for the report. Was able to replicate your 18.04 environment exactly in a docker image. It seems like during the training of one of the graph neural network models (the fourth one) the pytorch geometric data loader hangs after a few batches. We'll dig into the details more tomorrow.
My hunch is that it's some kind of problem with pytorch geometric's multiprocessing loader - will keep you updated.
This example segfaults with just the first model. A2a.smi.txt
Remove .txt extensions from the files above
Hey Pat!
I tried to replicate your case with a Docker environment and relevant files, but in this containerized environment - it was able to run through to completion without any errors with the files you sent over. Do you think you could provide some additional information on where the segmentation fault is occurring in your case and any additional error logs you might be running into.
Below is the docker image I used to try and recreate your environment (Ubuntu 18.04 NVIDA Runtime Image).
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu18.04
# Basic prereqs
RUN apt-get update; apt-get install curl bash git build-essential libxrender1 libxtst6 -y
# Install Anaconda3
ENV PATH="/root/miniconda3/bin:${PATH}"
ARG PATH="/root/miniconda3/bin:${PATH}"
RUN apt-get install -y wget && rm -rf /var/lib/apt/lists/*
RUN wget \
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
&& mkdir /root/.conda \
&& bash Miniconda3-latest-Linux-x86_64.sh -b \
&& rm -f Miniconda3-latest-Linux-x86_64.sh
# Set the conda version!
RUN conda --version
RUN conda install python=3.8
# Install
RUN conda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch
RUN conda install -y pyg -c pyg
RUN pip install olorenchemengine[full]
# Add optional open pyxl dependency (for reading excel)
RUN pip install openpyxl
# Copy files over
WORKDIR /reproduce
COPY . .
CMD ["python", "oce_model_search.py"]
I followed all of your steps and ended with the shared library issue I ran into earlier.
conda create -n oce python=3.8 conda activate oce conda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch conda install -y pyg -c pyg pip install olorenchemengine[full] pip install openpyxl ./oce_model_search.py
Traceback (most recent call last):
File "./oce_model_search.py", line 3, in
Thanks for sticking with us. I checked the Pytorch Geometric docs and it seems like the solution to this error might be to run your program with the LD_LIBRARY_PATH set which apparently Conda doesn't do by default when activating a new environment (weird! - https://github.com/conda/conda/issues/308).
For you this command should look like:
export LD_LIBRARY_PATH="/home/pwalters/anaconda3/envs/oce/lib"
Then you can run the program with: ./oce_model_search.py
Let me know if this resolves the issue. If it does, we'll make sure to make it clear in our installation FAQs.
Please ignore the previous email. The cache on my "locate"command was out of date.
I do this: find /home/pwalters/anaconda3/envs/oce/ -name libtorch_cuda_cu.so -print
the file isn't in that directory. It isn't being installed.
This seems like may even be hardware/driver-specific (GPU + CUDA driver compatibility) then, without more knowledge of the system it's a bit difficult to assess how to proceed.
The below commands allow for the successful installation onto a machine we have which is ubuntu 18.04 and has cuda version 11.3, what CUDA driver do you have?
In essence, we think that the CUDA versions used by the driver/torch/torch-geometric are incompatible somewhere. Our usual solution is to then directly specify the CUDA driver version in the below examples (and that is what is replicated in the Docker image Raunak had shared).
If that doesn't work, if we know the CUDA driver we may be able to re-replicate the issue and resolve it that way. If it doesn't we can offer to come in (however is best) and directly examine the machine.
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html
P.S. is there any other way in which the current machine differs from the machine you'd described in https://github.com/Oloren-AI/olorenchemengine/issues/48.
That differential may help diagnose the issue
This is the same machine I described in #48, just a fresh install. Unfortunately, I deleted the previous install.
nvidia-smi Fri Nov 11 12:44:38 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA Graphics... On | 00000000:18:00.0 Off | Off | | 41% 34C P8 13W / 140W | 84MiB / 16108MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1477 G /usr/lib/xorg/Xorg 64MiB | | 0 N/A N/A 1560 G /usr/bin/gnome-shell 17MiB | +-----------------------------------------------------------------------------+
I'm going to try this with docker, it may be simpler for everyone involved.
Yeah, that sounds great. In the meantime, we will try to procure a similar machine and get this error replicated although that will take an undetermined amount of time. Let's keep each other posted!
Just to confirm the recommended installation recipe.
conda create -n oce python=3.8 conda activate oce conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html pip install olorenchemengine[full]
Correct?
Correct, qualified that this is for CUDA 11.3.
To break it down conda create -n oce python=3.8 is to get a stable version of python for pytorch/torchgeometric
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch is the official recommended installation for pytorch
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html is the (historical) official recommended build for torch-geometric
pip install olorenchemengine[full] is the installation for OCE.
I did a fresh install, and the kernel dies when I try to run mm.run(models)
This is with Ubuntu 18.04.6 LTS I installed with conda create -n oce python=3.8 conda activate oce bash <(curl -s https://raw.githubusercontent.com/Oloren-AI/olorenchemengine/master/install.sh) conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html