AttributeError: 'DistributedDataParallel' object has no attribute 'postprocess'

ross-Hr commented 1 year ago

When i run the code with 2 gpus ( ''torchrun --nproc_per_node=2 train_cars.py'' or ''notebook_launcher(main, num_processes=2) in .py file'' ), the error occur:

` File "/data/code/Yolov7-training/yolov7/trainer.py", line 132, in calculate_eval_batch_loss preds = self.model.postprocess(fpn_heads_outputs, conf_thres=0.001) File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1207, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'DistributedDataParallel' object has no attribute 'postprocess'

` How can i fix it ? (I also asked in https://github.com/Chris-hughes10/pytorch-accelerated/issues/56)

ross-Hr commented 1 year ago

I use preds = self.model.module.postprocess(fpn_heads_outputs, conf_thres=0.001) to handle the bug and fixd it.

But another error occurs: File "/data/code/Yolov7-training/train_opiray.py", line 326, in main trainer.train( File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 467, in train self._run_training()

File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 678, in _run_training self._run_eval_epoch(self._eval_dataloader)

File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 830, in _run_eval_epoch self.callback_handler.call_event(

File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/callbacks.py", line 217, in call_event getattr(callback, event)(

File "/data/code/Yolov7-training/yolov7/evaluation/calculate_map_callback.py", line 79, in on_eval_epoch_end map_ = self.evaluator.compute(self.targets_json, predictions_json)

File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 64, in compute coco_eval = self._build_coco_eval(targets_json, predictions_json)

File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 185, in _build_coco_eval coco_preds = coco_targets.loadRes(preds)

File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pycocotools/coco.py", line 327, in loadRes assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), AssertionError: Results do not correspond to current coco set

I mean the module can affect subsequent operations...

I guess after wrapping the module with DataParallel, its properties (such as custom methods) become inaccessible. I'm trying to make some changes using the following code template, but I'm not sure if it works：

Chris-hughes10 commented 1 year ago

This looks like a bug in the trainer when using multi GPUs. I have pushed a fix for this now, can you see if this fixes things?

ross-Hr commented 1 year ago

This is the new error with self.get_model().post... .

Traceback (most recent call last):
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/accelerate/utils/launch.py", line 72, in __call__
    self.launcher(*args)
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/func_to_script/core.py", line 108, in scripted_function
    return func(**args)
  File "/data/code/Yolov7-training/train_opiray.py", line 327, in main
    trainer.train(
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 467, in train
    self._run_training()
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 678, in _run_training
    self._run_eval_epoch(self._eval_dataloader)
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 830, in _run_eval_epoch
    self.callback_handler.call_event(
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/callbacks.py", line 217, in call_event
    getattr(callback, event)(
  File "/data/code/Yolov7-training/yolov7/evaluation/calculate_map_callback.py", line 79, in on_eval_epoch_end
    map_ = self.evaluator.compute(self.targets_json, predictions_json)
  File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 64, in compute
    coco_eval = self._build_coco_eval(targets_json, predictions_json)
  File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 185, in _build_coco_eval
    coco_preds = coco_targets.loadRes(preds)
  File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pycocotools/coco.py", line 327, in loadRes
    assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), \
AssertionError: Results do not correspond to current coco set

Process finished with exit code 1

My suggestion is to merge postprocess with forward function, like :

So, in trainer.py, the call looks loke:

Chris-hughes10 commented 1 year ago

The error you are getting is when there is an inconsistency between the image ids in the prediction set and the ground truth set, moving the postprocess function inside forward won't fix that. I will try to reproduce and see if I can find what the issue is

Chris-hughes10 commented 1 year ago

Can you provide more detail on what you are trying to do, I am unable to reproduce your error running the train_cars.py script on 2 V100 GPUs?

ross-Hr commented 1 year ago

I conducted this experiment using a custom dataset (X-ray security). Mainly analyzing the relationship between various layers of YOLOV7 (including backbone, head, and feature maps).

In addition to using my own dataset, I also performed some data augmentation operations. But it does not affect the operation of the framework.

I am using a multi GPU (2 * RTX3090) environment. The initial error occurred during the first run. I suspect that during the CUDA runtime, there were no entries left for other functions except for forward function.

I run commands (torchrun --nproc_per_node=2 train_cars.py) directly in a shell environment and the first error occured. I loaded postprocess into forward and it works fine!

The error you are getting is when there is an inconsistency between the image ids in the prediction set and the ground truth set, moving the postprocess function inside forward won't fix that. I will try to reproduce and see if I can find what the issue is

This error I think (intuitively, and without querying the source code) is due to the fact that a copy of the model is reopened in cuda (instead of the current model for which the weights are calculated) after a call to self.get_model()

My torch is : 1.12.1+cu113

Chris-hughes10 commented 1 year ago

Ah Ok, you are using a custom dataset. I have a couple of thoughts:

the train cars script is a reference example for a particular dataset, it is not designed to work for arbitrary datasets
The error you are showing doesn't have anything to do with the model. I have seen this before, and it usually arises when you have an image ID in your prediction set that hasn't been seen before. I would double check how your predictions look and compare the format to your ground truth to see if there is anything unexpected.
Personally, I don't see how moving the postprocessing inside the forward method would help here at all. If this fixes this problem for you, I think that it is likely that a stage in your pipeline is not working the way you think.
The get model function is a convenience method to unwrap any distributed wrappers, it doesn't return a copy.

As I am unable to replicate the issue on the dataset the example is designed for, I don't think that this is a code issue so I am going to close this issue. Hopefully you get to the bottom of it

Chris-hughes10 / Yolov7-training

AttributeError: 'DistributedDataParallel' object has no attribute 'postprocess' #18