Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.46k stars 481 forks source link

Error training yolo_nas_l #1997

Open Vdol22 opened 1 month ago

Vdol22 commented 1 month ago

πŸ’‘ Your Question

Hi! I'm stuck with trying to train yolo_nas_l on custom data. I follow several guides and notebooks yet constantly come to one error - "You can use sliding window validation callback, but your model does not support sliding window inference. Please either remove the callback or use the model that supports sliding inference: "Segformer". Here's the code:

from super_gradients.common.object_names import Models
from super_gradients.training import Trainer, models
from super_gradients.training.losses import PPYoloELoss
from super_gradients.training.models.detection_models.pp_yolo_e import PPYoloEPostPredictionCallback
from super_gradients import init_trainer
from super_gradients.training.dataloaders.dataloaders import (
    coco_detection_yolo_format_train, 
    coco_detection_yolo_format_val
)
from super_gradients.training.metrics import DetectionMetrics

trainer = Trainer(experiment_name="YOLO_LEARN", ckpt_root_dir="checkpoints")
model = models.get(model_name=Models.YOLO_NAS_L, num_classes=1, pretrained_weights="coco")

dataset_params = {
    'data_dir': '.',
    'train_images_dir': 'images/train',
    'train_labels_dir': 'labels/train',
    'val_images_dir': 'images/val',
    'val_labels_dir': 'labels/val',
    'classes': ['person']
}

BATCH_SIZE = 8
WORKERS = 1

train_loader = coco_detection_yolo_format_train(
    dataset_params={
        'data_dir': dataset_params['data_dir'],
        'images_dir': dataset_params['train_images_dir'],
        'labels_dir': dataset_params['train_labels_dir'],
        'classes': dataset_params['classes']
    },
    dataloader_params={
        'batch_size':BATCH_SIZE,
        'num_workers':WORKERS
    }
)

valid_loader = coco_detection_yolo_format_val(
    dataset_params={
        'data_dir': dataset_params['data_dir'],
        'images_dir': dataset_params['val_images_dir'],
        'labels_dir': dataset_params['val_labels_dir'],
        'classes': dataset_params['classes']
    },
    dataloader_params={
        'batch_size':BATCH_SIZE,
        'num_workers':WORKERS
    }
)

training_params = {
    "max_epochs": 300,
    "warmup_mode": "LinearBatchLRWarmup",
    "warmup_initial_lr": 1e-6,
    "lr_warmup_epochs": 3,
    "initial_lr": 5e-4,
    "lr_mode": "CosineLRScheduler",
    "cosine_final_lr_ratio": 0.1,
    "loss": PPYoloELoss(
        use_static_assigner=False,
        num_classes=1
    ),
    "optimizer": "AdamW",
    "optimizer_params": {"weight_decay": 0.0001},
    "ema": True,
    "ema_params": {"decay": 0.9997, "decay_type": "threshold"},
    "valid_metrics_list": [
        DetectionMetrics(
            score_thres=0.1,
            top_k_predictions=300,
            num_cls=1,
            normalize_targets=True,
            post_prediction_callback=PPYoloEPostPredictionCallback(
                score_threshold=0.01,
                nms_top_k=1000,
                max_predictions=300,
                nms_threshold=0.7
            )
        )
    ],
    "metric_to_watch": 'mAP@0.50:0.95',
    "greater_metric_to_watch_is_better": True
}

trainer.train(model=model, training_params=training_params, train_loader=train_loader, valid_loader=valid_loader)

Here's the output:

Indexing dataset annotations: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 1999.91it/s]
Indexing dataset annotations: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 2981.73it/s]

StopIteration                             Traceback (most recent call last)
Cell In[9], line 1
----> 1 trainer.train(model=model, training_params=training_params, train_loader=train_loader, valid_loader=valid_loader)

File ~\AppData\Roaming\Python\Python39\site-packages\super_gradients\training\sg_trainer\sg_trainer.py:1482, in Trainer.train(self, model, training_params, train_loader, valid_loader, test_loaders, additional_configs_to_log)
   1475     raise ValueError(
   1476         "You can use sliding window validation callback, but your model does not support sliding window "
   1477         "inference. Please either remove the callback or use the model that supports sliding inference: "
   1478         "Segformer"
   1479     )
   1481 if isinstance(model, SupportsInputShapeCheck):
-> 1482     first_train_batch = next(iter(self.train_loader))
   1483     inputs, _, _ = sg_trainer_utils.unpack_batch_items(first_train_batch)
   1484     model.validate_input_shape(inputs.size())

File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
    628 if self._sampler_iter is None:
    629     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    630     self._reset()  # type: ignore[call-arg]
--> 631 data = self._next_data()
    632 self._num_yielded += 1
    633 if self._dataset_kind == _DatasetKind.Iterable and \
    634         self._IterableDataset_len_called is not None and \
    635         self._num_yielded > self._IterableDataset_len_called:

File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:1319, in _MultiProcessingDataLoaderIter._next_data(self)
   1317     if not self._persistent_workers:
   1318         self._shutdown_workers()
-> 1319     raise StopIteration
   1321 # Now `self._rcvd_idx` is the batch index we want to fetch
   1322 
   1323 # Check if the next sample has already been generated
   1324 if len(self._task_info[self._rcvd_idx]) == 2:

Please help, you lib looks so promising yet I don't understand what I do wrong.

Versions

PyTorch version: 2.3.0 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Windows 11 Pro GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A

Python version: 3.9.19 (main, May 6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.22631-SP0 Is CUDA available: True CUDA runtime version: 12.1.66 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2050 Nvidia driver version: 552.22 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Revision=

Versions of relevant libraries: [pip3] numpy==1.23.0 [pip3] onnx==1.15.0 [pip3] onnx-simplifier==0.4.36 [pip3] onnxruntime==1.15.0 [pip3] onnxsim==0.4.36 [pip3] torch==2.3.0 [pip3] torchaudio==2.3.0 [pip3] torchmetrics==0.8.0 [pip3] torchvision==0.18.0 [conda] blas 1.0 mkl
[conda] mkl 2021.4.0 pypi_0 pypi [conda] mkl-service 2.4.0 py39h2bbff1b_0
[conda] mkl_fft 1.3.1 py39h277e83a_0
[conda] mkl_random 1.2.2 py39hf11a4ad_0
[conda] numpy 1.23.0 pypi_0 pypi [conda] numpy-base 1.24.3 py39h005ec55_0
[conda] pytorch 2.3.0 py3.9_cuda12.1_cudnn8_0 pytorch [conda] pytorch-cuda 12.1 hde6ce7c_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torch 2.3.0 pypi_0 pypi [conda] torchaudio 2.3.0 pypi_0 pypi [conda] torchmetrics 0.8.0 pypi_0 pypi [conda] torchvision 0.18.0 pypi_0 pypi

BloodAxe commented 1 month ago

I don't think it has anything to do with a sliding window inference. It just happen to be nearby in the file where the stacktrace is printed. If you look closely to a stacktrace you will see "-> " symbols indicating where there error is coming from. Overall it looks like you have some exception happening in DataLoader while trying to make a batch.

For this I suggest to test whether you can get a single batch or sample from dataset. To simplify the debugging it's better to turn of all workers (workers: 0) when creating a DataLoader. This way you will get the exception in the main thread with better exception message that hopefully should give you a clear picture what is happening. Looking forward seeing this error message.

Vdol22 commented 1 month ago

Thank you kindly for a brief reply.

turn of all workers (workers: 0) when creating a DataLoader. There it is:

StopIteration                             Traceback (most recent call last)
Cell In[9], line 1
----> 1 trainer.train(model=model, training_params=training_params, train_loader=train_loader, valid_loader=valid_loader)

File ~\AppData\Roaming\Python\Python39\site-packages\super_gradients\training\sg_trainer\sg_trainer.py:1482, in Trainer.train(self, model, training_params, train_loader, valid_loader, test_loaders, additional_configs_to_log)
   1475     raise ValueError(
   1476         "You can use sliding window validation callback, but your model does not support sliding window "
   1477         "inference. Please either remove the callback or use the model that supports sliding inference: "
   1478         "Segformer"
   1479     )
   1481 if isinstance(model, SupportsInputShapeCheck):
-> 1482     first_train_batch = next(iter(self.train_loader))
   1483     inputs, _, _ = sg_trainer_utils.unpack_batch_items(first_train_batch)
   1484     model.validate_input_shape(inputs.size())

File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
    628 if self._sampler_iter is None:
    629     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    630     self._reset()  # type: ignore[call-arg]
--> 631 data = self._next_data()
    632 self._num_yielded += 1
    633 if self._dataset_kind == _DatasetKind.Iterable and \
    634         self._IterableDataset_len_called is not None and \
    635         self._num_yielded > self._IterableDataset_len_called:

File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:674, in _SingleProcessDataLoaderIter._next_data(self)
    673 def _next_data(self):
--> 674     index = self._next_index()  # may raise StopIteration
    675     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    676     if self._pin_memory:

File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:621, in _BaseDataLoaderIter._next_index(self)
    620 def _next_index(self):
--> 621     return next(self._sampler_iter)
Vdol22 commented 1 month ago

After some debugging I found out that printing these

train_images_dir = dataset_params['train_images_dir']
train_labels_dir = dataset_params['train_labels_dir']
val_images_dir = dataset_params['val_images_dir']
val_labels_dir = dataset_params['val_labels_dir']

train_loader_iter = iter(train_loader)
try:
    train_batch = next(train_loader_iter)
    display("Train Batch:", train_batch)
except StopIteration:
    display("No data fetched from train_loader")

valid_loader_iter = iter(valid_loader)
try:
    valid_batch = next(valid_loader_iter)
    display("Valid Batch:", valid_batch)
except StopIteration:
    display("No data fetched from valid_loader")

Results in 'No data fetched from train_loader' However the valid_loader works just fine.

Vdol22 commented 1 month ago

UPD: removing worker_init_fn in training dataloader seemed to have started it:

train_loader = coco_detection_yolo_format_train(
    dataset_params={
        'data_dir': dataset_params['data_dir'],
        'images_dir': dataset_params['train_images_dir'],
        'labels_dir': dataset_params['train_labels_dir'],
        'classes': dataset_params['classes']
    },
    dataloader_params={
        'batch_size': BATCH_SIZE,
        'num_workers': WORKERS,
        'shuffle': True,
        'drop_last': False,
        'pin_memory': True,
        # 'worker_init_fn': {
        #     '_target_': 'super_gradients.training.utils.utils.load_func',
        #     'dotpath': 'super_gradients.training.datasets.datasets_utils.worker_init_reset_seed'
        # },
        'collate_fn': 'DetectionCollateFN'
    }
)

It is strange though, that progress bar of an epoch now consists of 1/1. There are only 4 photos in my dataset (since I was trying to run training), so maybe that's the case.