Open SamuelLarkin opened 4 years ago
This looks like a high priority issue @ddamoursNRC @NRCGavin
Could you share your job submission script so I can test? You should have Internet access from the nodes.
Hi @joeydumont ,
I'm trying to install Sockeye-2 which has support for horovod a distributed deep learning training framework. I'm trying to follow the following guide: Build a Conda Environment with GPU Support for Horovod but with some added dependencies for Sockeye-2. The original guide's intent is to make a conda environment with all the major Deep Learning frameworks plus jupyter.
I'm not sure that my scripts are fully functional yet because I can't get them to access the internet or CUDA ;) but here's what I've got so far. Under trixie:/home/larkins/git/install/script/Sockeye-2.Horovod
, I'm trying to do:
source /project/WMT20/setup_tools
export OMPI_MCA_opal_cuda_support=true
export ENV_PREFIX=$CONDA_PREFIX/Sockeye-2.1.21
export CUDA_HOME=/usr/local/cuda-10.1
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$ENV_PREFIX
export HOROVOD_GPU_OPERATIONS=NCCL
conda env create --prefix $ENV_PREFIX --file environment.yml --force
Once the environment is properly created horovodrun --check-build
and I get a bunch of errors and
Horovod v0.19.5:
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[X] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[ ] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
We can see that NCCL wasn't detected even though it is part of the build file.
Hi Sam,
I tried this today, and I had the same errors as you during the download, but I had the same issue on both the head node and the compute nodes. The error is related to pip (or rather the urllib
not retrying when getting a RST packet from upstream). I was able to install the environment just now. Unfortunately, I still get some errors, as you'll see below.
The fact that it worked at night (I was just able to run the install job on the compute node) makes me think that some networking appliance was having trouble keeping up. The issue was happening consistently when trying to download mxnet-cuda101, which is about 750MB in size. Even using wget
to download it would consistently have issues, but wget
retries properly, so the download completed successfully.
Here's the script I used to install
#!/bin/bash
#SBATCH -p JobTesting
#SBATCH -A itops
#SBATCH --time=2:00:00
#SBATCH --gres=gpu:4
#SBATCH --mail-user=joey.dumont@nrc-cnrc.gc.ca
#SBATCH --mail-type=ALL
source /project/WMT20/setup_tools
export OMPI_MCA_opal_cuda_support=true
export ENV_PREFIX=$CONDA_PREFIX/Sockeye-2.1.21
export CUDA_HOME=/usr/local/cuda-10.1
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$ENV_PREFIX
export HOROVOD_GPU_OPERATIONS=NCCL
export PIP_VERBOSE=1
conda env create -vv --prefix $ENV_PREFIX --file environment.yml --force
conda activate $ENV_PREFIX
horovodrun --check-build
I got the same errors as you:
horovodrun --check-build
2020-09-10 22:30:09.681037: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:09.681211: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:09.681238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:25.038451: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:25.038626: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:25.038648: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:30.844146: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:30.844307: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:30.844338: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:36.443558: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:36.443722: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:36.443744: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:42.477443: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:42.477607: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:42.477626: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:48.617062: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:48.617224: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:48.617243: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Horovod v0.19.5:
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[X] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[ ] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
It turns out that the loader issues are a known problem in tf2.1, so I downgraded your tensorflow-gpu
version to 2.0.* in your environment file. This made TensorFlow stop complaining, but still no NCCL
in the --check-buid
output.
(/project/WMT20/opt/miniconda3/Sockeye-2.1.21) [admin.joey.dumont@cn135 ~]$ horovodrun --check-build
Horovod v0.19.5:
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[X] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[ ] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
I'll try to see what are the exact requirements for NCCL to be properly detected a bit later. In the logs, I see that nccl is installed, but I don't see any relevant errors.
Hope this helps.
Thanks @joeydumont. I will give your solution a try and see if I can get some clues to why NCCL seems installed but not detected by Horovod.
I've had similar issues in the past (not finding installed software) and one thing that turned up often was that the search path that was being searched did not contain the path to the installed software. If you haven't already, you may want to check that Horovod is searching the path where NCCL is installed.
Just a guess in the dark.
As far as download issues are concerned: this afternoon and yesterday TLS/SSL transfers are stalling w/ downloads interrupted mid-way through.
Resolving files.wolframcdn.com (files.wolframcdn.com)... 152.195.19.5 Connecting to files.wolframcdn.com (files.wolframcdn.com)|152.195.19.5|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1634087483 (1.5G) [application/octet-stream] Saving to: ‘files.wolframcdn.com/CUDA/12.1.0.0/CUDAResources-Lin64-12.1.0.paclet’
0% [ ] 12,632,055 241KB/s in 52s
2020-10-01 12:19:19 (239 KB/s) - Read error at byte 12632055/1634087483 (Connection reset by peer).
--2020-10-01 16:26:04-- https://files.wolframcdn.com/CUDA/12.1.0.0/CUDAResources-Lin64-12.1.0.paclet Resolving files.wolframcdn.com (files.wolframcdn.com)... 152.195.19.5 Connecting to files.wolframcdn.com (files.wolframcdn.com)|152.195.19.5|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1634087483 (1.5G) [application/octet-stream] Saving to: ‘CUDAResources-Lin64-12.1.0.paclet’
CUDAResources-Lin64-12.1 100%[=================================>] 1.52G 110MB/s in 14s
2020-10-01 16:26:19 (109 MB/s) - ‘CUDAResources-Lin64-12.1.0.paclet' saved [1634087483/1634087483]
It does not appear the regular http transfers are being impact - as large file downloads from CentOS mirror site are successful.
[fieldsa@cn101 ~]$ wget http://distro.ibiblio.org/centos/7.8.2003/isos/x86_64/CentOS-7-x86_64-DVD-2003.iso --2020-10-01 14:49:36-- http://distro.ibiblio.org/centos/7.8.2003/isos/x86_64/CentOS-7-x86_64-DVD-2003.iso Resolving distro.ibiblio.org (distro.ibiblio.org)... 152.19.134.43 Connecting to distro.ibiblio.org (distro.ibiblio.org)|152.19.134.43|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 4781506560 (4.5G) [application/octet-stream] Saving to: ‘CentOS-7-x86_64-DVD-2003.iso’
100%[=======================================>] 4,781,506,560 7.11MB/s in 12m 33s
2020-10-01 15:02:09 (6.05 MB/s) - ‘CentOS-7-x86_64-DVD-2003.iso’ saved [4781506560/4781506560]
An https transfer test was done to centos mirror - it presently fails as well, so this is not specific to certain external servers - a follow-up ticket will be sent to firewall team.
[fieldsa@cn101 download-test]$ wget https://mirror.its.dal.ca/centos/7.8.2003/isos/x86_64/CentOS-7-x86_64-DVD-2003.iso --2020-10-01 15:12:43-- https://mirror.its.dal.ca/centos/7.8.2003/isos/x86_64/CentOS-7-x86_64-DVD-2003.iso Resolving mirror.its.dal.ca (mirror.its.dal.ca)... 192.75.96.254, 2001:410:a000:50::20 Connecting to mirror.its.dal.ca (mirror.its.dal.ca)|192.75.96.254|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 4781506560 (4.5G) [application/octet-stream] Saving to: ‘CentOS-7-x86_64-DVD-2003.iso’
25% [=========> ] 1,210,253,046 5.19MB/s in 3m 50s
2020-10-01 15:16:34 (5.01 MB/s) - Read error at byte 1210253046/4781506560 (Connection reset by peer).
Is this issue still valid, or can this be closed?
I'm trying to install Sockeye with Horovod but in order to do so, I need access to the internet and access to CUDA/
nvcc
. The requirement seems to be mutually exclusive on Trixie. On the head node you have internet access but notnvcc
and on a worker node you don't have internet access but CUDA is install.Here the error message I'm seeing.
How do I get a valid SSL on a node or access to CUDA on the head node?