Running through Dockerfile broken

VHellendoorn commented 2 years ago

Describe the bug When using an image based on the provided Dockerfile and running the quick start steps (download enron data, run deep.py), execution crashes before training begins.

To Reproduce Steps to reproduce the behavior:

Build an image using the provided Dockerfile
Run said image, mounting 8 RTX800 GPUs
Fetch enron data using the prepare_dataset.py script
Run ./deepy.py pretrain_gpt2.py -d configs small.yml local_configs.yml
The code crashes with a non-descript NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

Expected behavior Training starts or a specific error is provided.

Proposed solution The NCCL error is typically a stand-in for a real issue that is not relayed back through multiprocessing. As a first step, it would be nice to know if this setup works out-of-the-box for others; in that case, it might be my resources or CUDA version.

Environment (please complete the following information):

GPUs: 8 RTX8000 GPUs
- Configs: Ubuntu 20.04, Cuda 11.2

Additional context Add any other context about the problem here.

StellaAthena commented 2 years ago

Based on this issue I would try downgrading to pytorch 1.6 and cuda 10.2. It seems that the problem is GPU-specific and I do not have access to a RTX8000 for testing unfortunately.

EricHallahan commented 2 years ago

Can you tell me what you mean when you say that the Quadro RTX 8000s are running CUDA 11.2? Do you mean your driver supports CUDA 11.2? If so, what driver version? Did you update any packages? More detailed information about your environment would greatly assist us in pinpointing the issue.

The one thing that stands out to me is that the Quadro RTX 8000 is a Turing-based card (SM75). I see the DeepSpeed install, the APEX install, and the installation of the Megatron kernels all as potential areas of interest. The last of those three may be of significant interest given your description: As far as I can tell the Quick Start guide does not document the installation of the Megatron kernels. If you have not done so already, try running python ./megatron/fused_kernels/setup.py install from the repository base and check if the issue persists.

VHellendoorn commented 2 years ago

Thanks for your quick responses. I did dig into that linked issue thread a bit, but no luck with the fixes proposed there.

@EricHallahan, I was referring to the output of nvcc --version within the container, which on a second look is 11.1 (not 11.2), my apologies. The specific build is cuda_11.1.TC455_06.29190527_0, so I'm guessing driver version 455.06. This is all run through an nvidia-docker backed container (version 2.6.0-1) on Linux; the driver/CUDA version in the host OS is 450.142 / 11.0, but I suspect that discrepancy doesn't play a role. No changes were made to the code and nothing else was installed in the container.

The container did prompt me to run that python command the first time I ran deep.py, and installing that seemed to have been successful. So the full chain of commands is:

git clone https://github.com/EleutherAI/gpt-neox && cd gpt-neox
docker build -t gptneox -f Dockerfile .
docker run --rm -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --mount type=bind,src=$PWD,dst=/gpt-neox gptneox
# Inside the container
cd /gpt-neox
sudo python3 prepare_data.py
sudo python /gpt-neox/megatron/fused_kernels/setup.py install
sudo ./deepy.py pretrain_gpt2.py -d configs small.yml local_setup.yml

EricHallahan commented 2 years ago

Can you run it again with NCCL_DEBUG=WARN?

VHellendoorn commented 2 years ago

No luck. It looks like the error sets in around the allreduce area in PyTorch (stacktrace excerpt below), though running with a single GPU doesn't solve it, just slightly shorten the trace:

  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1169, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed

I'm sensing that this is a PyTorch + NCCL (+ maybe Docker) rabbit hole more than anything. E.g., this thread has some interesting ideas. When I have time, I'll pursue some leads in that direction.

EricHallahan commented 2 years ago

Both the NCCL troubleshooting guide and the PyTorch Lightning issue that @StellaAthena pointed to at the beginning of this thread echo the same potential fix: You may need to increase the shared memory size of the container.

If the fix proposed above does not resolve the problem, it would would be a huge help to provide a more detailed stack trace (the warnings that NCCL throws are particularly valuable). I think we can agree that the issue you are experiencing is somewhere in the interaction between NCCL and your environment (most likely Docker)—we just need to find it.

VHellendoorn commented 2 years ago

That worked! Adding –shm-size=1g –ulimit memlock=-1 to the nvidia-docker run (for others: note that this won't work with docker run --runtime=nvidia) command solved it. The default shared memory was 64MB, which is evidently far too little.

Thanks so much for your help!

VHellendoorn commented 2 years ago

Oh and maybe a very minor note, in case you are updating the README anyways: the quick start section talks about a config file named local_configs.yml, which no longer exists. That should be local_setup.yml. This is fine further down.

VHellendoorn commented 2 years ago

Actuall, would it help if I submit a PR documenting the "Using Docker" steps? Happy to add something to the README.

StellaAthena commented 2 years ago

Actuall, would it help if I submit a PR documenting the "Using Docker" steps? Happy to add something to the README.

Yes please!

PyxAI commented 2 years ago

That worked! Adding –shm-size=1g –ulimit memlock=-1 to the nvidia-docker run (for others: note that this won't work with docker run --runtime=nvidia) command solved it. The default shared memory was 64MB, which is evidently far too little.

Thanks so much for your help!

That worked for me too Just want to mention that the dashes in the command here needs to be replaced -shm-size=1g -ulimit memlock=-1 – != -

EleutherAI / gpt-neox

Running through Dockerfile broken #419