Dockerfile install - Githubissues

brian-dellabetta commented 2 years ago

Hi,

I am trying to build an image with cuquantum and the code samples installed. Here is what I have so far, compiled from the README here and in the documentation :

FROM nvcr.io/nvidia/pytorch:22.01-py3

# Get cuquantum
ENV CUQUANTUM_ROOT=/opt/cuquantum0.1.0.30
ARG TARFILE=cuquantum-linux-x86_64-0.1.0.30-archive.tar.xz
RUN wget -O /tmp/${TARFILE} \
    https://developer.download.nvidia.com/compute/cuquantum/redist/linux-x86_64/${TARFILE} && \
    mkdir -p ${CUQUANTUM_ROOT} && \
    tar xvf /tmp/${TARFILE} -C ${CUQUANTUM_ROOT} --strip-components=1 && \
    #lib64/ is missing, symlink it to lib/
    ln -s ${CUQUANTUM_ROOT}/lib ${CUQUANTUM_ROOT}/lib64 && \
    rm /tmp/${TARFILE}
ENV LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${LD_LIBRARY_PATH}

# Install cuquantum python bindings, remove previous cupy version
# TODO verify
RUN pip uninstall -y cupy-cuda115 && \
    conda install -c conda-forge cuquantum-python

ENV CUSTATEVEC_ROOT=${CUQUANTUM_ROOT}
ENV CUTENSORNET_ROOT=${CUQUANTUM_ROOT}
ENV PATH=/usr/local/cuda/bin/:${PATH}

# Get samples repo
ARG TARFILE=v0.1.0.0.tar.gz
RUN wget -O /tmp/${TARFILE} https://github.com/NVIDIA/cuQuantum/archive/refs/tags/${TARFILE} && \
    mkdir -p ${CUSTATEVEC_ROOT}/code_samples && \
    tar xvf /tmp/${TARFILE} -C ${CUSTATEVEC_ROOT}/code_samples --strip-components=1 && \
    rm /tmp/${TARFILE}

The image has cupy-cuda115, the conda install of cuquantum-python installs another version of cupy as a dependency so I uninstall the old one (it will complain during import if both are available). make all builds successfully (though the lib64->lib symlink is needed for it to work), but I am unable to run the python samples without hitting import errors.

I am running on an intel-chip mac, just trying to clear up the import errors before we run this on a cloud instance with an nvidia GPU mounted in.

Before posting any stacktraces, am I on the right track here? Maybe I should use a different base image that has an equivalent version of cupy. I'm also not sure if the cuda version is incompatible.

I am happy to submit a PR with the working Dockerfile once we figure this all out :)

mtjrider commented 2 years ago

Hi @brian-dellabetta. Thanks for your interest in cuQuantum!

The image has cupy-cuda115, the conda install of cuquantum-python installs another version of cupy as a dependency so I uninstall the old one (it will complain during import if both are available). make all builds successfully (though the lib64->lib symlink is needed for it to work), but I am unable to run the python samples without hitting import errors.

All samples require an Nvidia GPU to run. Specifically, a GPU with compute capability 7.0+. Here's a useful table.

I am running on an intel-chip mac, just trying to clear up the import errors before we run this on a cloud instance with an nvidia GPU mounted in.

I'm guessing this is the issue. The import statements will fail without a valid driver installation. Without seeing the full error output, I cannot confirm.

Before posting any stacktraces, am I on the right track here? Maybe I should use a different base image that has an equivalent version of cupy. I'm also not sure if the cuda version is incompatible.

For cuQuantum, as long as your CUDA toolkit version is 11.2+, and CuPy's version is 9.5+, you should be fine. If you have a more specific concern, please include it in your response.

I am happy to submit a PR with the working Dockerfile once we figure this all out :)

Unfortunately, we aren't accepting code contributions at this time.

I'm wondering why you're using wget to acquire the binaries when they are automatically installed by conda in this line:

conda install -c conda-forge cuquantum-python

(e.g.)

conda install -c conda-forge cuquantum-python
...
The following NEW packages will be INSTALLED:

...
  cupy               conda-forge/linux-64::cupy-10.1.0-py310h64c8dd9_1
  cuquantum          conda-forge/linux-64::cuquantum-0.1.0.30-h5c60f85_2
  cuquantum-python   conda-forge/linux-64::cuquantum-python-0.1.0.0-py310h013f86e_3
  cutensor           conda-forge/linux-64::cutensor-1.4.0.6-h7537e88_2
...

It is also true that all of the samples are hosted in this repository.

Let us know if you're still having trouble or if you have other questions!

leofang commented 2 years ago

One more thing:

though the lib64->lib symlink is needed for it to work

Yes, we have become aware of this issue for building cuQuantum Python from source. We'll push a fix shortly. Thanks for bringing it up, Brian.

brian-dellabetta commented 2 years ago

@mtjrider I'm just trying to make sure the image is valid and has all dependencies before attempting to run on an nvidia GPU. This requires an nvidia V100 or higher for compute capability 7.0+, corresponding to a p3.2xlarge or higher on AWS, and these get pricey, so I'm trying to tackle as much beforehand as possible.

Here's the error I'm seeing:

>>> import cuquantum
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/cupy/__init__.py", line 18, in <module>
    from cupy import _core  # NOQA
  File "/opt/conda/lib/python3.8/site-packages/cupy/_core/__init__.py", line 1, in <module>
    from cupy._core import core  # NOQA
  File "cupy/_core/core.pyx", line 1, in init cupy._core.core
  File "/opt/conda/lib/python3.8/site-packages/cupy/cuda/__init__.py", line 8, in <module>
    from cupy.cuda import compiler  # NOQA
  File "/opt/conda/lib/python3.8/site-packages/cupy/cuda/compiler.py", line 14, in <module>
    from cupy.cuda import function
  File "cupy/cuda/function.pyx", line 1, in init cupy.cuda.function
  File "cupy/_core/_carray.pyx", line 1, in init cupy._core._carray
  File "cupy/_core/internal.pyx", line 1, in init cupy._core.internal
  File "cupy/cuda/memory.pyx", line 1, in init cupy.cuda.memory
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

This seems to me more related to the versions of cupy and libcuda than an actual runtime error from lack of gpu. I might be mistaken though that the driver won't live in the docker image, that it will need to be installed on host and mounted into the image? I hope to try on a VM with a GPU later this week, will post updates here.

If not a Dockerfile, will an image be made available at some point on the NGC catalog or elsewhere? I'm sure it would be useful to others

brian-dellabetta commented 2 years ago

Also @mtjrider the wget on the repo is just to pull in the code samples. i didn't see them in the installed directories /opt/conda/lib/python3.8/site-packages/cuquantum_python-0.1.0.0.dist-info /opt/conda/lib/python3.8/site-packages/cuquantum

Also, thanks for all the help!

mtjrider commented 2 years ago

@mtjrider I'm just trying to make sure the image is valid and has all dependencies before attempting to run on an nvidia GPU. This requires an nvidia V100 or higher for compute capability 7.0+, corresponding to a p3.2xlarge or higher on AWS, and these get pricey, so I'm trying to tackle as much beforehand as possible.

Makes perfect sense. Thanks for this clarification. To be clear, I've tested your Dockerfile on a system with GPUs to compile and run the tests, and it works without issue. When you deploy, please take care to confirm that the driver and compilation toolchain are compatible. The CUDA driver and kernel mode driver compatibility is documented here.

The following error indicates that the CUDA driver is missing. This is not installed in the container. Here is an architecture overview.

>>> import cuquantum
...
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Also @mtjrider the wget on the repo is just to pull in the code samples. i didn't see them in the installed directories /opt/conda/lib/python3.8/site-packages/cuquantum_python-0.1.0.0.dist-info /opt/conda/lib/python3.8/site-packages/cuquantum

I meant that you may also clone the samples because they are hosted in this repository:

git clone https://github.com/NVIDIA/cuQuantum.git cuquantum && \
  ls -la cuquantum/samples
##  custatevec
##  cutensornet

Note: per this comment, I had to modify the Makefile to rename lib64 to lib. This line. Separately, I had to set LD_LIBRARY_PATH=/opt/conda/lib:$LD_LIBRARY_PATH. The command I used to compile the custatevec samples is:

CUSTATEVEC_ROOT=/opt/conda make

Here, I should note that I removed any wget commands because they are redundant with the conda install command.

brian-dellabetta commented 2 years ago

@mtjrider thank you! The architecture diagram is what I was missing, this is super helpful. I appreciate your help in sanity checking the image in a working environment, we'll try to reproduce on our end.

I will close and re-open the issue if we have further questions. Thanks again for the help

NVIDIA / cuQuantum

Dockerfile install #1