deephealthproject / eddl

European Distributed Deep Learning (EDDL) library. A general-purpose library initially developed to cover deep learning needs in healthcare use cases within the DeepHealth project.
https://deephealthproject.github.io/eddl/
MIT License
34 stars 10 forks source link

LSTM training fail on single GPU, but not with multiple GPUs #338

Open thistlillo opened 2 years ago

thistlillo commented 2 years ago

With the latest versions of EDDL (1.2.0) and ECVL (1.1.0), I get a CUDA error when training the model using a single GPU. I have no problems when using 2 or 4 GPUs. The error occurs systematically at the beginning of the third epoch and does not seem to depend on the batch size. It does not depend on the memory consumption parameter (“full_mem”, “mid_mem” or “low_mem”), I tried all of them. The GPU is a nVidia V100. With previous versions of the libraries, this error did not occur (but I was using a different GPU).

.Traceback (most recent call last):
  File "C01_2_rec_mod_edll.py", line 98, in <module>
    fire.Fire({
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "C01_2_rec_mod_edll.py", line 46, in train
    rec_mod.train()
  File "/mnt/datasets/uc5/UC5_pipeline_forked/src/eddl_lib/recurrent_module.py", line 289, in train
    eddl.train_batch(rnn, [cnn_visual, thresholded], [Y])
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/pyeddl/eddl.py", line 435, in train_batch
    return _eddl.train_batch(net, in_, out)
RuntimeError: [CUDA ERROR]: invalid argument (1) raised in delete_tensor | (check_cuda)

The code is not yet available on the repository, please let me know what details I can add.

salvacarrion commented 2 years ago

Can you send a minimal script to debug it? With that information, I'm a bit lost

bernia commented 2 years ago

Hello @thistlillo, we have been debugging this issue but we have not been able to reproduce the problem. Our tests surpass five epochs using both configurations with 1 and 2 GPUs. Do you think we can help you in a virtual meeting?

thistlillo commented 2 years ago

Hello @bernia and sorry for this late reply, but I did not receive any notification from github about your reply. I have now installed version 1.3 and next week I will perform some more tests. I will report back here.

The code published for UC5 is not up-to-date, now it uses also the ECVL dataloader. I work on a fork that periodically join after cleansing the code. I will try also to update the repository with clean code.

thistlillo commented 2 years ago

Hello, I have found the cause of the issue. It is related to the dimension of the last batch. When the last batch contains less than "batch size" items, the training of a LSTM-based network fails. The training does not fail when the last batch is kept during the training of a convolutional neural network (resnet18 in my case).

Contrary to what I said, the LSTM training fails when running both on a single GPU and on multiple GPUs. I was able to replicate the issue using the latest versions of ECVL and EDDL, both cudnn-enabled and not.