Open Vdol22 opened 1 month ago
I don't think it has anything to do with a sliding window inference. It just happen to be nearby in the file where the stacktrace is printed. If you look closely to a stacktrace you will see "-> " symbols indicating where there error is coming from. Overall it looks like you have some exception happening in DataLoader while trying to make a batch.
For this I suggest to test whether you can get a single batch or sample from dataset. To simplify the debugging it's better to turn of all workers (workers: 0) when creating a DataLoader. This way you will get the exception in the main thread with better exception message that hopefully should give you a clear picture what is happening. Looking forward seeing this error message.
Thank you kindly for a brief reply.
turn of all workers (workers: 0) when creating a DataLoader. There it is:
StopIteration Traceback (most recent call last)
Cell In[9], line 1
----> 1 trainer.train(model=model, training_params=training_params, train_loader=train_loader, valid_loader=valid_loader)
File ~\AppData\Roaming\Python\Python39\site-packages\super_gradients\training\sg_trainer\sg_trainer.py:1482, in Trainer.train(self, model, training_params, train_loader, valid_loader, test_loaders, additional_configs_to_log)
1475 raise ValueError(
1476 "You can use sliding window validation callback, but your model does not support sliding window "
1477 "inference. Please either remove the callback or use the model that supports sliding inference: "
1478 "Segformer"
1479 )
1481 if isinstance(model, SupportsInputShapeCheck):
-> 1482 first_train_batch = next(iter(self.train_loader))
1483 inputs, _, _ = sg_trainer_utils.unpack_batch_items(first_train_batch)
1484 model.validate_input_shape(inputs.size())
File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
628 if self._sampler_iter is None:
629 # TODO(https://github.com/pytorch/pytorch/issues/76750)
630 self._reset() # type: ignore[call-arg]
--> 631 data = self._next_data()
632 self._num_yielded += 1
633 if self._dataset_kind == _DatasetKind.Iterable and \
634 self._IterableDataset_len_called is not None and \
635 self._num_yielded > self._IterableDataset_len_called:
File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:674, in _SingleProcessDataLoaderIter._next_data(self)
673 def _next_data(self):
--> 674 index = self._next_index() # may raise StopIteration
675 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
676 if self._pin_memory:
File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:621, in _BaseDataLoaderIter._next_index(self)
620 def _next_index(self):
--> 621 return next(self._sampler_iter)
After some debugging I found out that printing these
train_images_dir = dataset_params['train_images_dir']
train_labels_dir = dataset_params['train_labels_dir']
val_images_dir = dataset_params['val_images_dir']
val_labels_dir = dataset_params['val_labels_dir']
train_loader_iter = iter(train_loader)
try:
train_batch = next(train_loader_iter)
display("Train Batch:", train_batch)
except StopIteration:
display("No data fetched from train_loader")
valid_loader_iter = iter(valid_loader)
try:
valid_batch = next(valid_loader_iter)
display("Valid Batch:", valid_batch)
except StopIteration:
display("No data fetched from valid_loader")
Results in 'No data fetched from train_loader' However the valid_loader works just fine.
UPD: removing worker_init_fn in training dataloader seemed to have started it:
train_loader = coco_detection_yolo_format_train(
dataset_params={
'data_dir': dataset_params['data_dir'],
'images_dir': dataset_params['train_images_dir'],
'labels_dir': dataset_params['train_labels_dir'],
'classes': dataset_params['classes']
},
dataloader_params={
'batch_size': BATCH_SIZE,
'num_workers': WORKERS,
'shuffle': True,
'drop_last': False,
'pin_memory': True,
# 'worker_init_fn': {
# '_target_': 'super_gradients.training.utils.utils.load_func',
# 'dotpath': 'super_gradients.training.datasets.datasets_utils.worker_init_reset_seed'
# },
'collate_fn': 'DetectionCollateFN'
}
)
It is strange though, that progress bar of an epoch now consists of 1/1. There are only 4 photos in my dataset (since I was trying to run training), so maybe that's the case.
π‘ Your Question
Hi! I'm stuck with trying to train yolo_nas_l on custom data. I follow several guides and notebooks yet constantly come to one error - "You can use sliding window validation callback, but your model does not support sliding window inference. Please either remove the callback or use the model that supports sliding inference: "Segformer". Here's the code:
Here's the output:
Please help, you lib looks so promising yet I don't understand what I do wrong.
Versions
PyTorch version: 2.3.0 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Windows 11 Pro GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A
Python version: 3.9.19 (main, May 6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.22631-SP0 Is CUDA available: True CUDA runtime version: 12.1.66 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2050 Nvidia driver version: 552.22 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Revision=
Versions of relevant libraries: [pip3] numpy==1.23.0 [pip3] onnx==1.15.0 [pip3] onnx-simplifier==0.4.36 [pip3] onnxruntime==1.15.0 [pip3] onnxsim==0.4.36 [pip3] torch==2.3.0 [pip3] torchaudio==2.3.0 [pip3] torchmetrics==0.8.0 [pip3] torchvision==0.18.0 [conda] blas 1.0 mkl
[conda] mkl 2021.4.0 pypi_0 pypi [conda] mkl-service 2.4.0 py39h2bbff1b_0
[conda] mkl_fft 1.3.1 py39h277e83a_0
[conda] mkl_random 1.2.2 py39hf11a4ad_0
[conda] numpy 1.23.0 pypi_0 pypi [conda] numpy-base 1.24.3 py39h005ec55_0
[conda] pytorch 2.3.0 py3.9_cuda12.1_cudnn8_0 pytorch [conda] pytorch-cuda 12.1 hde6ce7c_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torch 2.3.0 pypi_0 pypi [conda] torchaudio 2.3.0 pypi_0 pypi [conda] torchmetrics 0.8.0 pypi_0 pypi [conda] torchvision 0.18.0 pypi_0 pypi