Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.51k stars 489 forks source link

'Trainer' object has no attribute 'train_loader' error #967

Closed marvlyngkhoi closed 1 year ago

marvlyngkhoi commented 1 year ago

I'm Trying to train yolo-nas on a custom dataset and I get the below error while running the training on google colab using

trainer.train(model=model, 
              training_params=train_params, 
              train_loader=train_data, 
              valid_loader=val_data)

image

dagshub[bot] commented 1 year ago

Join the discussion on DagsHub!

Louis-Dupont commented 1 year ago

Hi @fleventy-5 I think that the train_loader you passed is not a Dataloader but instead either None, [], False, or any "0" value. Can you please check it ?

momonoki3nenn commented 1 year ago

My version of super-gradients is 3.1.0

This is part of my code. train_data

train_data is like Dataloader. But I get an error"'Trainer' object has no attribute 'train_loader'"

No error occurs in YOLONAS Starter Notebook.

I get an error in my local environment. Why is that?

Tried and tested code trainer.train( model=model, train_params=train_params, train_loader=train_data, valid_loader=val_data )

I'll put up some of my code now. one_part

Louis-Dupont commented 1 year ago

@momonoki3nenn, @fleventy-5 , I did not manage to reproduce it, but we pushed a change that should fix it. It will be in the next release, but meanwhile, you can install SG from our repo directly :)

pip install git+https://github.com/Deci-AI/super-gradients
momonoki3nenn commented 1 year ago

@Louis-Dupont Thanks for responding! I installed it and tried it out. Error "'Trainer' object has no attribute 'train_loader'" no longer occurs. But I got the following error in my environment. error

I wonder if sg_trainer.py", line 1211 is the cause.

I updated the Python version from 3.9.13 to 3.10.11 and tried again with the same results.

My PC GPU is RTX-3080 with 40GB memory. Is it difficult to run the YOLO-NAS train in my environment?

marvlyngkhoi commented 1 year ago

@momonoki3nenn, @fleventy-5 , I did not manage to reproduce it, but we pushed a change that should fix it. It will be in the next release, but meanwhile, you can install SG from our repo directly :)

pip install git+https://github.com/Deci-AI/super-gradients

@Louis-Dupont I'm able to train now using custom dataset on colab notebooks Thanks

Louis-Dupont commented 1 year ago

@momonoki3nenn The StopIteration appears because we are trying to iterate over a Dataloader that is apparently empty (you can try next(iter(train_data)) and you will get the same error)

Another thing that supports this hypothesis is the fact that on the right of the "Caching annotation" line, there is "1/1" written, which shows that your dataset doesn't include multiple images/labels.

So now the question is why is the dataloader empty ? I would guess that either you dont point to the right path of your dataset or maybe your dataset doesnt have the right structrure. Feel free to check this documentation page to see how your data should be structured to use this (or another) dataset. To test it simply, you can check the length or iterate over it len(train_data) or next(iter(train_data)).

Vikram12301 commented 1 year ago

Is there any fix found for the StopIteration issue? I got the same error

momonoki3nenn commented 1 year ago

@Louis-Dupont StopIteration occurred when both train and val data were one by one. When multiple train and val data were used, StopIteration did not occur. There are times when I want to learn a single piece of data. I hope you can handle this in the future!

As the story goes, an error occurred when I made the data multiple. Caching annotations: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:00<00:00, 2874.87it/s] Caching annotations: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 2628.01it/s] Train epoch 0: 0%| | 0/2 [00:19<?, ?it/s] [2023-05-13 18:30:35] ERROR - sg_trainer_utils.py - Uncaught exception Traceback (most recent call last): File "C:\Users\zynas\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\zynas\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "c:\Users\zynas\.vscode\extensions\ms-python.python-2023.8.0\pythonFiles\lib\python\debugpy\__main__.py", line 39, in <module> cli.main() File "c:\Users\zynas\.vscode\extensions\ms-python.python-2023.8.0\pythonFiles\lib\python\debugpy/..\debugpy\server\cli.py", line 430, in main run() File "c:\Users\zynas\.vscode\extensions\ms-python.python-2023.8.0\pythonFiles\lib\python\debugpy/..\debugpy\server\cli.py", line 284, in run_file runpy.run_path(target, run_name="__main__") File "c:\Users\zynas\.vscode\extensions\ms-python.python-2023.8.0\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 321, in run_path return _run_module_code(code, init_globals, run_name, File "c:\Users\zynas\.vscode\extensions\ms-python.python-2023.8.0\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 135, in _run_module_code _run_code(code, mod_globals, init_globals, File "c:\Users\zynas\.vscode\extensions\ms-python.python-2023.8.0\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 124, in _run_code exec(code, run_globals) File "C:\projects\evaluate_yolo_nas\train.py", line 62, in <module> trainer.train( File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\super_gradients\training\sg_trainer\sg_trainer.py", line 1247, in train train_metrics_tuple = self._train_epoch(epoch=epoch, silent_mode=silent_mode) File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\super_gradients\training\sg_trainer\sg_trainer.py", line 442, in _train_epoch loss, loss_log_items = self._get_losses(outputs, targets) File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\super_gradients\training\sg_trainer\sg_trainer.py", line 475, in _get_losses loss = self.criterion(outputs, targets) File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\super_gradients\training\losses\yolox_loss.py", line 155, in forward return self._compute_loss(predictions, targets) File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\super_gradients\training\losses\yolox_loss.py", line 181, in _compute_loss x_shifts, y_shifts, expanded_strides, transformed_outputs, raw_outputs = self.prepare_predictions(predictions) File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\super_gradients\training\losses\yolox_loss.py", line 335, in prepare_predictions batch_size, num_anchors, h, w, num_outputs = output.shape ValueError: not enough values to unpack (expected 5, got 3)

I have verified that the correct path is specified. The far mat of the text file to be learned is the object (class) number The object center X coordinate The object center Y coordinate The object width The object height. (I use LabelImg for annotation) Is there a problem with the train data...

Louis-Dupont commented 1 year ago

@Vikram12301 , did you install the nightly ?

pip install git+https://github.com/Deci-AI/super-gradients

If yes, do you have a single sample, in your train set? Please also provide the full snippet of code you are using to run, with among other dataset instantiation.

Louis-Dupont commented 1 year ago

@momonoki3nenn yeah it looks like. Did you try with batch_size=1 ? My guess is that maybe you have batch_size > len(dataset), which means that the dataloader cannot prepare any full batch.

Concerning your other error, it comes from the code of YoloXDetectionLoss but YoloNAS expects PPYoloELoss (like in the notebooks) If you are still working on YoloNAS, this could definitely explain your error. In that case, change like in the notebook to use PPYoloELoss. If not, could you please share with me your code snippet with all of the code?

momonoki3nenn commented 1 year ago

@Louis-Dupont I tried two patterns and got errors in both patterns. However, for one of the patterns, after modifying it, the error did not occur and it seems to have been learned.

What we tried:Commented out 853-856 in site-packages/treelib/tree.py. Predicts success because learning is complete.

I don't know if this modification method is correct, but we will continue to learn in a successful way for a while!

Louis-Dupont commented 1 year ago

@momonoki3nenn , can you try:

It looks like tree doesn't always encode to utf8 on windows even though it is supposed to

momonoki3nenn commented 1 year ago

@Louis-Dupont Thanks for the reply. My environment defaulted to utf-8. I checked and it seems to be a bug in treelib. Reference Site def write(line): self._reader += line.decode("utf-8") + "\n" ・revision def write(line): self._reader += line + "\n"

The following error occurred after the correction. File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\super_gradients\training\utils\sg_trainer_utils.py", line 257, in display_epoch_summary summary_tree.show() File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\treelib\tree.py", line 848, in show self.__print_backend(nid, level, idhidden, filter, File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\treelib\tree.py", line 222, in __print_backend func('{0}{1}'.format(pre, label).encode('utf-8')) File "c:\projects\evaluate_yolo_nas\venv\lib\site-packages\treelib\tree.py", line 845, in write self._reader += line + "\n" TypeError: can't concat str to bytes

So fix it. func('{0}{1}'.format(pre, label).encode('utf-8')) ・revision func('{0}{1}'.format(pre, label))

Errors no longer occur. I don't know if the fix is right. What is certain is that it is not a YOLO-NAS bug.

Louis-Dupont commented 1 year ago

@momonoki3nenn, thanks for the investigation! I don't understand why it fails in only very specific environments, so it's a bit hard for us to fix what we don't manage to reproduce ...

We might be able to fix it in SG if we understand exactly what leads to this encoding error (even if due to treelib bad implementation of encoding).

First idea

The arrows might lead to this error. You can try to replace: https://github.com/Deci-AI/super-gradients/blob/a30fa8fdc623533df785831f7457967066fb2ebe/src/super_gradients/training/utils/sg_trainer_utils.py#L41-L50 with

 def to_symbol(self) -> str: 
     """Get the symbol representing the current increase type""" 
     if self == IncreaseType.NONE: 
         return "" 
     elif self == IncreaseType.IS_GREATER: 
         return "[UP]" 
     elif self == IncreaseType.IS_SMALLER: 
         return "[DOWN]" 
     else: 
         return "=" 

Second idea

The colored() function might lead to this error. You can try to replace: https://github.com/Deci-AI/super-gradients/blob/a30fa8fdc623533df785831f7457967066fb2ebe/src/super_gradients/training/utils/sg_trainer_utils.py#L231-L237

diff_with_prev_colored = f"{monitored_value.has_increased_from_previous.to_symbol()} {change_from_previous}"
diff_with_best_colored = f"{monitored_value.has_increased_from_best.to_symbol()} {change_from_best}"

Third idea

My third guess is to do both at the same time.

If none of these works then it's probably just that the treelib doesnt even work with plain text in your case, which means that we need an alternative.

momonoki3nenn commented 1 year ago

@Louis-Dupont First idea no longer causes errors. The arrows were the cause. Thanks for letting me know.