Chris-hughes10 / Yolov7-training

A clean, modular implementation of the Yolov7 model family, which uses the official pretrained weights, with utilities for training the model on custom (non-COCO) tasks.
GNU General Public License v3.0
116 stars 35 forks source link

AttributeError: 'DistributedDataParallel' object has no attribute 'postprocess' #18

Closed ross-Hr closed 1 year ago

ross-Hr commented 1 year ago

When i run the code with 2 gpus ( ''torchrun --nproc_per_node=2 train_cars.py'' or ''notebook_launcher(main, num_processes=2) in .py file'' ), the error occur:

` File "/data/code/Yolov7-training/yolov7/trainer.py", line 132, in calculate_eval_batch_loss preds = self.model.postprocess(fpn_heads_outputs, conf_thres=0.001) File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1207, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'DistributedDataParallel' object has no attribute 'postprocess'

` How can i fix it ? (I also asked in https://github.com/Chris-hughes10/pytorch-accelerated/issues/56)

ross-Hr commented 1 year ago

I use preds = self.model.module.postprocess(fpn_heads_outputs, conf_thres=0.001) to handle the bug and fixd it.

But another error occurs: File "/data/code/Yolov7-training/train_opiray.py", line 326, in main trainer.train( File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 467, in train self._run_training()

File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 678, in _run_training self._run_eval_epoch(self._eval_dataloader)

File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 830, in _run_eval_epoch self.callback_handler.call_event(

File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/callbacks.py", line 217, in call_event getattr(callback, event)(

File "/data/code/Yolov7-training/yolov7/evaluation/calculate_map_callback.py", line 79, in on_eval_epoch_end map_ = self.evaluator.compute(self.targets_json, predictions_json)

File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 64, in compute coco_eval = self._build_coco_eval(targets_json, predictions_json)

File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 185, in _build_coco_eval coco_preds = coco_targets.loadRes(preds)

File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pycocotools/coco.py", line 327, in loadRes assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), AssertionError: Results do not correspond to current coco set

I mean the module can affect subsequent operations...

I guess after wrapping the module with DataParallel, its properties (such as custom methods) become inaccessible. I'm trying to make some changes using the following code template, but I'm not sure if it works:

image

Chris-hughes10 commented 1 year ago

This looks like a bug in the trainer when using multi GPUs. I have pushed a fix for this now, can you see if this fixes things?

ross-Hr commented 1 year ago

This is the new error with self.get_model().post... .

Traceback (most recent call last):
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/accelerate/utils/launch.py", line 72, in __call__
    self.launcher(*args)
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/func_to_script/core.py", line 108, in scripted_function
    return func(**args)
  File "/data/code/Yolov7-training/train_opiray.py", line 327, in main
    trainer.train(
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 467, in train
    self._run_training()
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 678, in _run_training
    self._run_eval_epoch(self._eval_dataloader)
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 830, in _run_eval_epoch
    self.callback_handler.call_event(
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/callbacks.py", line 217, in call_event
    getattr(callback, event)(
  File "/data/code/Yolov7-training/yolov7/evaluation/calculate_map_callback.py", line 79, in on_eval_epoch_end
    map_ = self.evaluator.compute(self.targets_json, predictions_json)
  File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 64, in compute
    coco_eval = self._build_coco_eval(targets_json, predictions_json)
  File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 185, in _build_coco_eval
    coco_preds = coco_targets.loadRes(preds)
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pycocotools/coco.py", line 327, in loadRes
    assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), \
AssertionError: Results do not correspond to current coco set

Process finished with exit code 1

My suggestion is to merge postprocess with forward function, like :

image image

So, in trainer.py, the call looks loke:

image

Chris-hughes10 commented 1 year ago

The error you are getting is when there is an inconsistency between the image ids in the prediction set and the ground truth set, moving the postprocess function inside forward won't fix that. I will try to reproduce and see if I can find what the issue is

Chris-hughes10 commented 1 year ago

Can you provide more detail on what you are trying to do, I am unable to reproduce your error running the train_cars.py script on 2 V100 GPUs?

ross-Hr commented 1 year ago

I conducted this experiment using a custom dataset (X-ray security). Mainly analyzing the relationship between various layers of YOLOV7 (including backbone, head, and feature maps).

In addition to using my own dataset, I also performed some data augmentation operations. But it does not affect the operation of the framework.

I am using a multi GPU (2 * RTX3090) environment. The initial error occurred during the first run. I suspect that during the CUDA runtime, there were no entries left for other functions except for forward function.

I run commands (torchrun --nproc_per_node=2 train_cars.py) directly in a shell environment and the first error occured. I loaded postprocess into forward and it works fine!

The error you are getting is when there is an inconsistency between the image ids in the prediction set and the ground truth set, moving the postprocess function inside forward won't fix that. I will try to reproduce and see if I can find what the issue is

This error I think (intuitively, and without querying the source code) is due to the fact that a copy of the model is reopened in cuda (instead of the current model for which the weights are calculated) after a call to self.get_model()

My torch is : 1.12.1+cu113

Chris-hughes10 commented 1 year ago

Ah Ok, you are using a custom dataset. I have a couple of thoughts:

As I am unable to replicate the issue on the dataset the example is designed for, I don't think that this is a code issue so I am going to close this issue. Hopefully you get to the bottom of it