NVIDIA / semantic-segmentation

Nvidia Semantic Segmentation monorepo
BSD 3-Clause "New" or "Revised" License
1.76k stars 388 forks source link

Docker build fails on Amazon SageMaker: fatal error: ATen/cuda/DeviceUtils.cuh: No such file or directory #include "ATen/cuda/DeviceUtils.cuh" #168

Open la-cruche opened 2 years ago

la-cruche commented 2 years ago

Hi,

I'm trying to do a Docker build . on a SageMaker-managed EC2 instance in AWS (ml.g4dn.12xlarge, with T4 cards). docker build . runs for few minutes, outputs several things and errors with the following:

csrc/layer_norm_cuda_kernel.cu:4:10: fatal error: ATen/cuda/DeviceUtils.cuh: No such file or directory
 #include "ATen/cuda/DeviceUtils.cuh"
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

Interestingly, early in the build it says

Step 17/18 : RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && python setup.py install --cuda_ext --cpp_ext
 ---> Running in e8df4e2bf69e
Cloning into 'apex'...
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'

Warning: Torch did not find available GPUs on this system.
 If your intention is to cross-compile, this is not an error.
By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
Volta (compute capability 7.0), Turing (compute capability 7.5),
and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
If you wish to cross-compile for a single specific architecture,
export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.

which suprises me since I have 4 GPUs on my machine.

How to build that docker image in a SageMaker-managed AWS EC2 instance?

doulemint commented 2 years ago

Hi, did you solve this problem? I also fail to compile apex when I build docker.

lahiiru commented 2 years ago

ATen/cuda/DeviceUtils.cuh: No such file or directory

This issue is already discussed in https://github.com/NVIDIA/apex/issues/1043

  1. Remove the apex build command from the Dockerfile
    # RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && python setup.py install --cuda_ext --cpp_ext
  2. Add below to the Dockerfile instead the above removed line.
    RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && git reset --hard 3fe10b5597ba14a748ebb271a6ab97c09c5701ac && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

    Note: See the APEX readme to find latest build instructions.

"Torch did not find available GPUs on this system"

  1. You need to install NVidia docker plugin (You might already have it). Then use nvidia-docker instead of docker command
  2. Make sure you expose the GPUs when running the container (i.e. NV_GPU='0,1' nvidia-docker ... or --gpus)

    You might be interested in AWS guide on deep learning containers.

Matt-Dinh commented 1 year ago

ATen/cuda/DeviceUtils.cuh: No such file or directory

This issue is already discussed in NVIDIA/apex#1043

  1. Remove the apex build command from the Dockerfile
    # RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && python setup.py install --cuda_ext --cpp_ext
  2. Add below to the Dockerfile instead the above removed line.

    RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && git reset --hard 3fe10b5597ba14a748ebb271a6ab97c09c5701ac && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

    Note: See the APEX readme to find latest build instructions.

"Torch did not find available GPUs on this system"

  1. You need to install NVidia docker plugin (You might already have it). Then use nvidia-docker instead of docker command
  2. Make sure you expose the GPUs when running the container (i.e. NV_GPU='0,1' nvidia-docker ... or --gpus) You might be interested in AWS guide on deep learning containers.

It worked. It's been a while but still thank you so much.