Segmentation fault when using 4 GPUs for training

thumbe3 commented 4 years ago

Specs:

Python version: 3.6.8
Pytorch version: 1.4.0
4 v100 GPUs
Cuda version: 10.1
Nvidia Driver Version: 418.87.00

I added thefollowing line in dlrm_s_pytorch.py

import faulthandler; faulthandler.enable()

and used the following command to run the code

python3 -X faulthandler dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=/path-to-data --processed-data-file=/path-to-npz-file --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=64 --test-freq 0 --print-freq=1024 --print-time --use-gpu

It executes for some iterations(variable: different during different runs) and then fails with a segmentation fault. Here is a sample output

Using 4 GPU(s)...
Reading pre-processed data=/users/ushmal/kaggleAdDisplayChallenge_processed.npz
Sparse features= 26, Dense features= 13
Reading pre-processed data=/users/ushmal/kaggleAdDisplayChallenge_processed.npz
Sparse features= 26, Dense features= 13
time/loss/accuracy (if enabled):
Finished training it 1024/613937 of epoch 0, 51.35 ms/it, loss 0.520202, accuracy 75.478 %
Finished training it 2048/613937 of epoch 0, 28.98 ms/it, loss 0.506464, accuracy 76.196 %
Finished training it 3072/613937 of epoch 0, 29.48 ms/it, loss 0.505029, accuracy 76.314 %
Finished training it 4096/613937 of epoch 0, 30.34 ms/it, loss 0.494111, accuracy 76.935 %
Finished training it 5120/613937 of epoch 0, 30.36 ms/it, loss 0.496054, accuracy 76.781 %
Finished training it 6144/613937 of epoch 0, 30.44 ms/it, loss 0.487835, accuracy 77.235 %
Finished training it 7168/613937 of epoch 0, 30.65 ms/it, loss 0.486214, accuracy 77.292 %
Fatal Python error: Segmentation fault

Thread 0x00007f64c1a25700 (most recent call first):

Thread 0x00007f64c2226700 (most recent call first):

Current thread 0x00007f64c2a27700 (most recent call first):

Thread 0x00007f64c3a29700 (most recent call first):
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/cuda/comm.py", line 165 in gather
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68 in forward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 101 in backward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/autograd/function.py", line 77 in apply

Thread 0x00007f64c3228700 (most recent call first):

Thread 0x00007f65f70b2740 (most recent call first):
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99 in backward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/tensor.py", line 195 in backward
  File "dlrm_s_pytorch.py", line 814 in <module>
Segmentation fault (core dumped)

When using 2 GPUs or a single GPU the segmentation fault doesn't arise even after 100000 iterations. Thanks!

mnaumovfb commented 4 years ago

Thank you for reporting this issue. We will take a look at it.

I wanted to clarify a couple of things:

Do you see the crash without the faulthandler (i.e. no new line and no command line prefix)?
I wanted to confirm that you see no crash on a single or two GPUs as mentioned above.

thumbe3 commented 4 years ago

Thanks for the response

1) Yes, I see the crash without the fault handler as well 2) Yes, I don't see a crash when using single as well as two gpus

mnaumovfb commented 4 years ago

We are still investigating what is happening with the newer version of the framework. However, I wanted to mention that I ran the latest version of the code without a crash, with the following older version of PyTorch and CUDA:

import torch print(torch.version) 1.1.0 print(torch.version.cuda) 9.0.176

albanD commented 4 years ago

This might be caused by this issue in pytorch: https://github.com/pytorch/pytorch/issues/31906 Could you try installing the nightly build and check that this is fixed please?

mnaumovfb commented 4 years ago

I believe that the above PyTorch issue was indeed the source of this bug. The back trace from the core file generated during the crash is very similar to the one reported above.

I have rerun both Criteo Kaggle and Terabyte datasets on 8 GPUs with the nightly PyTorch build, where the above issue is fixed, and I do not see the crash any more.

Please give it a try and let me know if it is resolved for you as well. You can obtain the nightly PyTorch build using for example the following command with the conda environment "conda install pytorch torchvision cudatoolkit=10.1 -c pytorch-nightly"

thumbe3 commented 4 years ago

Thanks for solving the issue.!I currently don't have the resources to rerun the experiment. I am closing the issu since it seems like the issue is fixed.

Adamits commented 4 years ago

Is this fix in the pytorch 1.4.0 pypy distribution? I am getting pretty much the same backtrace when running backward() on 4 GPUs with DataParallel. Is there a build that I should be using? I am having an unrelated issue training my model with pytorch 1.5.0

mnaumovfb commented 4 years ago

I don't think the fix made it into the PyTorch 1.4.0. Can you please try using 1.5 or the nightly version? Let me know if it works for you.

Adamits commented 4 years ago

Hi,

Ok, I was trying 1.4.0 because of https://github.com/huggingface/transformers/issues/3936. My experiment is using HuggingFace. I will try the nightly version (it errored the last time I tried though.)

Davidrjx commented 3 years ago

i met with same issue

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/python3 trains/conditional_layout_gan_train.py -c options/conditional_'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007ffff7a22e97 in raise () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7ffbde7fc700 (LWP 108955))]
(gdb) bt
#0  0x00007ffff7a22e97 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7a24801 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff7a6d897 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ffff7a7490a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ffff7a7be1c in free () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007fff1b8bf384 in at::TensorIterator::compute_shape() () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#6  0x00007fff1b8bfa7c in at::TensorIterator::build() () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#7  0x00007fff1b8c02c8 in at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#8  0x00007fff1b5b263a in at::native::mul(at::Tensor const&, at::Tensor const&) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#9  0x00007fff1e1def33 in at::CUDAType::(anonymous namespace)::mul(at::Tensor const&, at::Tensor const&) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#10 0x00007fff1bb28960 in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(at::Tensor const&, at::Tensor const&), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#11 0x00007fff1b62ab81 in std::result_of<c10::Dispatcher::callUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1} (c10::DispatchTable const&)>::type c10::LeftRight<c10::DispatchTable>::read<c10::Dispatcher::callUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}>(c10::Dispatcher::callUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}&&) const () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#12 0x00007fff1dbbda28 in torch::autograd::VariableType::(anonymous namespace)::mul(at::Tensor const&, at::Tensor const&) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#13 0x00007fff1bb28960 in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(at::Tensor const&, at::Tensor const&), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#14 0x00007fff64a28e78 in at::Tensor::mul(at::Tensor const&) const () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so
#15 0x00007fff1d43786b in torch::autograd::generated::MulBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#16 0x00007fff1dbedff6 in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#17 0x00007fff1dbea453 in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#18 0x00007fff1dbeb082 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#19 0x00007fff1dbe4979 in torch::autograd::Engine::thread_init(int) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#20 0x00007fff647d308a in torch::autograd::python::PythonEngine::thread_init(int) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so
#21 0x00007fffeb06adef in execute_native_thread_routine () from /usr/local/lib/python3.6/dist-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#22 0x00007ffff77cc6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#23 0x00007ffff7b0588f in clone () from /lib/x86_64-linux-gnu/libc.so.6

for local setting below,

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.14.5

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti

Nvidia driver version: 418.43
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.0

Versions of relevant libraries:
[pip3] numpy==1.17.1
[pip3] torch==1.4.0
[pip3] torchvision==0.5.0
[conda] Could not collect

any result about nightly version verification above?

mnaumovfb commented 3 years ago

As we were discussing above, can you please try using 1.5 or the nightly version of PyTorch?

I'm assuming it has worked in the past because there was no follow up, but let me know if it works for you.

Davidrjx commented 3 years ago

As we were discussing above, can you please try using 1.5 or the nightly version of PyTorch?

I'm assuming it has worked in the past because there was no follow up, but let me know if it works for you.

very good, thank your reply @mnaumovfb, and it's a pity no anyone has reply back, but some work should will be done considering your suggestion with my collegue later.

facebookresearch / dlrm

Segmentation fault when using 4 GPUs for training #42