Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.59k stars 510 forks source link

Training crashes on nth item #1342

Closed mmoollllee closed 1 year ago

mmoollllee commented 1 year ago

💡 Your Question

I try to train in Google Colab on a custom dataset. The runtime always crashes on the 43rd item in Epoch 0. Is there a way to say which image is being processed in that moment?

The console stream is now moved to /content/drive/MyDrive/checkpoints/project2/console_Aug04_11_20_55.txt

[2023-08-04 11:20:56] INFO - sg_trainer.py - Using EMA with params {'decay': 0.9, 'decay_type': 'threshold'}
[2023-08-04 11:21:03] INFO - sg_trainer_utils.py - TRAINING PARAMETERS:
    - Mode:                         Single GPU
    - Number of GPUs:               1          (1 available on the machine)
    - Dataset size:                 10138      (len(train_set))
    - Batch size per GPU:           7          (batch_size)
    - Batch Accumulate:             1          (batch_accumulate)
    - Total batch size:             7          (num_gpus * batch_size)
    - Effective Batch size:         7          (num_gpus * batch_size * batch_accumulate)
    - Iterations per epoch:         1448       (len(train_loader))
    - Gradient updates per epoch:   1448       (len(train_loader) / batch_accumulate)

[2023-08-04 11:21:03] INFO - sg_trainer.py - Started training for 50 epochs (0/49)

Train epoch 0:   3%|â–Ž         | 43/1448 [00:58<22:03,  1.06it/s, PPYoloELoss/loss=3.89, PPYoloELoss/loss_cls=2.43, PPYoloELoss/loss_dfl=1.14, PPYoloELoss/loss_iou=0.354, gpu_mem=10.5]

Versions

No response

mmoollllee commented 1 year ago

By running everything on the local system I now got hundred lines of the following error messages:

[...]
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7024,0,0], thread: [60,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7024,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7024,0,0], thread: [62,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7024,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Here is the jupyter notebook I'm using via JarvisLabs.ai: https://github.com/mmoollllee/yolo-nas-object-blurring/blob/main/notebooks/train-yolo-nas.ipynb

mmoollllee commented 1 year ago

Got it. Error occurs when labels extend beyond the edge of the pictures, which happens in label-studio quite some times. I imported the dataset to roboflow (which mentioned the problem on upload already) and reexported and the problem is gone :)