Open la-cruche opened 2 years ago
Hi, did you solve this problem? I also fail to compile apex when I build docker.
ATen/cuda/DeviceUtils.cuh: No such file or directory
This issue is already discussed in https://github.com/NVIDIA/apex/issues/1043
# RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && python setup.py install --cuda_ext --cpp_ext
RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && git reset --hard 3fe10b5597ba14a748ebb271a6ab97c09c5701ac && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Note: See the APEX readme to find latest build instructions.
"Torch did not find available GPUs on this system"
nvidia-docker
instead of docker
commandMake sure you expose the GPUs when running the container (i.e. NV_GPU='0,1' nvidia-docker ...
or --gpus
)
You might be interested in AWS guide on deep learning containers.
ATen/cuda/DeviceUtils.cuh: No such file or directory
This issue is already discussed in NVIDIA/apex#1043
- Remove the apex build command from the Dockerfile
# RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && python setup.py install --cuda_ext --cpp_ext
Add below to the Dockerfile instead the above removed line.
RUN cd /home/ && git clone https://github.com/NVIDIA/apex.git apex && cd apex && git reset --hard 3fe10b5597ba14a748ebb271a6ab97c09c5701ac && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Note: See the APEX readme to find latest build instructions.
"Torch did not find available GPUs on this system"
- You need to install NVidia docker plugin (You might already have it). Then use
nvidia-docker
instead ofdocker
command- Make sure you expose the GPUs when running the container (i.e.
NV_GPU='0,1' nvidia-docker ...
or--gpus
) You might be interested in AWS guide on deep learning containers.
It worked. It's been a while but still thank you so much.
Hi,
I'm trying to do a Docker build . on a SageMaker-managed EC2 instance in AWS (ml.g4dn.12xlarge, with T4 cards).
docker build .
runs for few minutes, outputs several things and errors with the following:Interestingly, early in the build it says
which suprises me since I have 4 GPUs on my machine.
How to build that docker image in a SageMaker-managed AWS EC2 instance?