ai4d-iasc / trixie

Scripts and documentation about trixie hpc
17 stars 3 forks source link

SSL issue on nodes #31

Open SamuelLarkin opened 4 years ago

SamuelLarkin commented 4 years ago

I'm trying to install Sockeye with Horovod but in order to do so, I need access to the internet and access to CUDA/nvcc. The requirement seems to be mutually exclusive on Trixie. On the head node you have internet access but not nvcc and on a worker node you don't have internet access but CUDA is install.

Here the error message I'm seeing.

Traceback (most recent call last):
  File "/project/WMT20/opt/miniconda3/Sockeye-2.1.21/lib/python3.7/site-packages/pip/_vendor/urllib3/contrib/pyopenssl.py", line 313, in recv_into
    return self.connection.recv_into(*args, **kwargs)
  File "/project/WMT20/opt/miniconda3/Sockeye-2.1.21/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
    self._raise_ssl_error(self._ssl, result)
  File "/project/WMT20/opt/miniconda3/Sockeye-2.1.21/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

How do I get a valid SSL on a node or access to CUDA on the head node?

itamblyn commented 4 years ago

This looks like a high priority issue @ddamoursNRC @NRCGavin

joeydumont commented 4 years ago

Could you share your job submission script so I can test? You should have Internet access from the nodes.

SamuelLarkin commented 4 years ago

Hi @joeydumont ,

I'm trying to install Sockeye-2 which has support for horovod a distributed deep learning training framework. I'm trying to follow the following guide: Build a Conda Environment with GPU Support for Horovod but with some added dependencies for Sockeye-2. The original guide's intent is to make a conda environment with all the major Deep Learning frameworks plus jupyter.

I'm not sure that my scripts are fully functional yet because I can't get them to access the internet or CUDA ;) but here's what I've got so far. Under trixie:/home/larkins/git/install/script/Sockeye-2.Horovod, I'm trying to do:

source /project/WMT20/setup_tools
export OMPI_MCA_opal_cuda_support=true
export ENV_PREFIX=$CONDA_PREFIX/Sockeye-2.1.21
export CUDA_HOME=/usr/local/cuda-10.1
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$ENV_PREFIX
export HOROVOD_GPU_OPERATIONS=NCCL
conda env create --prefix $ENV_PREFIX --file environment.yml --force

Once the environment is properly created horovodrun --check-build and I get a bunch of errors and

Horovod v0.19.5:

Available Frameworks:
    [X] TensorFlow
    [X] PyTorch
    [X] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [ ] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo

We can see that NCCL wasn't detected even though it is part of the build file.

joeydumont commented 4 years ago

Hi Sam,

I tried this today, and I had the same errors as you during the download, but I had the same issue on both the head node and the compute nodes. The error is related to pip (or rather the urllib not retrying when getting a RST packet from upstream). I was able to install the environment just now. Unfortunately, I still get some errors, as you'll see below.

The fact that it worked at night (I was just able to run the install job on the compute node) makes me think that some networking appliance was having trouble keeping up. The issue was happening consistently when trying to download mxnet-cuda101, which is about 750MB in size. Even using wget to download it would consistently have issues, but wget retries properly, so the download completed successfully.

Here's the script I used to install

#!/bin/bash
#SBATCH -p JobTesting
#SBATCH -A itops
#SBATCH --time=2:00:00
#SBATCH --gres=gpu:4
#SBATCH --mail-user=joey.dumont@nrc-cnrc.gc.ca
#SBATCH --mail-type=ALL

source /project/WMT20/setup_tools
export OMPI_MCA_opal_cuda_support=true
export ENV_PREFIX=$CONDA_PREFIX/Sockeye-2.1.21
export CUDA_HOME=/usr/local/cuda-10.1
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$ENV_PREFIX
export HOROVOD_GPU_OPERATIONS=NCCL
export PIP_VERBOSE=1
conda env create -vv --prefix $ENV_PREFIX --file environment.yml --force
conda activate $ENV_PREFIX
horovodrun --check-build

I got the same errors as you:

horovodrun --check-build
2020-09-10 22:30:09.681037: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:09.681211: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:09.681238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:25.038451: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:25.038626: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:25.038648: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:30.844146: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:30.844307: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:30.844338: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:36.443558: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:36.443722: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:36.443744: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:42.477443: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:42.477607: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:42.477626: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-09-10 22:30:48.617062: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:48.617224: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-09-10 22:30:48.617243: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Horovod v0.19.5:

Available Frameworks:
    [X] TensorFlow
    [X] PyTorch
    [X] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [ ] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo    

It turns out that the loader issues are a known problem in tf2.1, so I downgraded your tensorflow-gpu version to 2.0.* in your environment file. This made TensorFlow stop complaining, but still no NCCL in the --check-buid output.

(/project/WMT20/opt/miniconda3/Sockeye-2.1.21) [admin.joey.dumont@cn135 ~]$ horovodrun --check-build
Horovod v0.19.5:

Available Frameworks:
    [X] TensorFlow
    [X] PyTorch
    [X] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [ ] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo    

I'll try to see what are the exact requirements for NCCL to be properly detected a bit later. In the logs, I see that nccl is installed, but I don't see any relevant errors.

Hope this helps.

SamuelLarkin commented 4 years ago

Thanks @joeydumont. I will give your solution a try and see if I can get some clues to why NCCL seems installed but not detected by Horovod.

jhickeyNRC commented 4 years ago

I've had similar issues in the past (not finding installed software) and one thing that turned up often was that the search path that was being searched did not contain the path to the installed software. If you haven't already, you may want to check that Horovod is searching the path where NCCL is installed.

Just a guess in the dark.

fieldsa commented 4 years ago

As far as download issues are concerned: this afternoon and yesterday TLS/SSL transfers are stalling w/ downloads interrupted mid-way through.

xfer speed - on trixie hn2 and cn101 this afternoon:

Resolving files.wolframcdn.com (files.wolframcdn.com)... 152.195.19.5 Connecting to files.wolframcdn.com (files.wolframcdn.com)|152.195.19.5|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1634087483 (1.5G) [application/octet-stream] Saving to: ‘files.wolframcdn.com/CUDA/12.1.0.0/CUDAResources-Lin64-12.1.0.paclet’

0% [ ] 12,632,055 241KB/s in 52s

2020-10-01 12:19:19 (239 KB/s) - Read error at byte 12632055/1634087483 (Connection reset by peer).

xfer speed - on another host around same time today:

--2020-10-01 16:26:04-- https://files.wolframcdn.com/CUDA/12.1.0.0/CUDAResources-Lin64-12.1.0.paclet Resolving files.wolframcdn.com (files.wolframcdn.com)... 152.195.19.5 Connecting to files.wolframcdn.com (files.wolframcdn.com)|152.195.19.5|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1634087483 (1.5G) [application/octet-stream] Saving to: ‘CUDAResources-Lin64-12.1.0.paclet’

CUDAResources-Lin64-12.1 100%[=================================>] 1.52G 110MB/s in 14s

2020-10-01 16:26:19 (109 MB/s) - ‘CUDAResources-Lin64-12.1.0.paclet' saved [1634087483/1634087483]

fieldsa commented 4 years ago

It does not appear the regular http transfers are being impact - as large file downloads from CentOS mirror site are successful.

[fieldsa@cn101 ~]$ wget http://distro.ibiblio.org/centos/7.8.2003/isos/x86_64/CentOS-7-x86_64-DVD-2003.iso --2020-10-01 14:49:36-- http://distro.ibiblio.org/centos/7.8.2003/isos/x86_64/CentOS-7-x86_64-DVD-2003.iso Resolving distro.ibiblio.org (distro.ibiblio.org)... 152.19.134.43 Connecting to distro.ibiblio.org (distro.ibiblio.org)|152.19.134.43|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 4781506560 (4.5G) [application/octet-stream] Saving to: ‘CentOS-7-x86_64-DVD-2003.iso’

100%[=======================================>] 4,781,506,560 7.11MB/s in 12m 33s

2020-10-01 15:02:09 (6.05 MB/s) - ‘CentOS-7-x86_64-DVD-2003.iso’ saved [4781506560/4781506560]

fieldsa commented 4 years ago

An https transfer test was done to centos mirror - it presently fails as well, so this is not specific to certain external servers - a follow-up ticket will be sent to firewall team.

[fieldsa@cn101 download-test]$ wget https://mirror.its.dal.ca/centos/7.8.2003/isos/x86_64/CentOS-7-x86_64-DVD-2003.iso --2020-10-01 15:12:43-- https://mirror.its.dal.ca/centos/7.8.2003/isos/x86_64/CentOS-7-x86_64-DVD-2003.iso Resolving mirror.its.dal.ca (mirror.its.dal.ca)... 192.75.96.254, 2001:410:a000:50::20 Connecting to mirror.its.dal.ca (mirror.its.dal.ca)|192.75.96.254|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 4781506560 (4.5G) [application/octet-stream] Saving to: ‘CentOS-7-x86_64-DVD-2003.iso’

25% [=========> ] 1,210,253,046 5.19MB/s in 3m 50s

2020-10-01 15:16:34 (5.01 MB/s) - Read error at byte 1210253046/4781506560 (Connection reset by peer).

itamblyn commented 2 years ago

Is this issue still valid, or can this be closed?