Closed ross-Hr closed 1 year ago
I use preds = self.model.module.postprocess(fpn_heads_outputs, conf_thres=0.001)
to handle the bug and fixd it.
But another error occurs:
File "/data/code/Yolov7-training/train_opiray.py", line 326, in main trainer.train( File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 467, in train self._run_training()
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 678, in _run_training self._run_eval_epoch(self._eval_dataloader)
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 830, in _run_eval_epoch self.callback_handler.call_event(
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/callbacks.py", line 217, in call_event getattr(callback, event)(
File "/data/code/Yolov7-training/yolov7/evaluation/calculate_map_callback.py", line 79, in on_eval_epoch_end map_ = self.evaluator.compute(self.targets_json, predictions_json)
File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 64, in compute coco_eval = self._build_coco_eval(targets_json, predictions_json)
File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 185, in _build_coco_eval coco_preds = coco_targets.loadRes(preds)
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pycocotools/coco.py", line 327, in loadRes assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())),
AssertionError: Results do not correspond to current coco set
I mean the module
can affect subsequent operations...
I guess after wrapping the module with DataParallel, its properties (such as custom methods) become inaccessible. I'm trying to make some changes using the following code template, but I'm not sure if it works:
This looks like a bug in the trainer when using multi GPUs. I have pushed a fix for this now, can you see if this fixes things?
This is the new error with self.get_model().post...
.
Traceback (most recent call last):
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/accelerate/utils/launch.py", line 72, in __call__
self.launcher(*args)
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/func_to_script/core.py", line 108, in scripted_function
return func(**args)
File "/data/code/Yolov7-training/train_opiray.py", line 327, in main
trainer.train(
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 467, in train
self._run_training()
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 678, in _run_training
self._run_eval_epoch(self._eval_dataloader)
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/trainer.py", line 830, in _run_eval_epoch
self.callback_handler.call_event(
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pytorch_accelerated/callbacks.py", line 217, in call_event
getattr(callback, event)(
File "/data/code/Yolov7-training/yolov7/evaluation/calculate_map_callback.py", line 79, in on_eval_epoch_end
map_ = self.evaluator.compute(self.targets_json, predictions_json)
File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 64, in compute
coco_eval = self._build_coco_eval(targets_json, predictions_json)
File "/data/code/Yolov7-training/yolov7/evaluation/coco_evaluator.py", line 185, in _build_coco_eval
coco_preds = coco_targets.loadRes(preds)
File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/pycocotools/coco.py", line 327, in loadRes
assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), \
AssertionError: Results do not correspond to current coco set
Process finished with exit code 1
My suggestion is to merge postprocess with forward function, like :
So, in trainer.py, the call looks loke:
The error you are getting is when there is an inconsistency between the image ids in the prediction set and the ground truth set, moving the postprocess function inside forward won't fix that. I will try to reproduce and see if I can find what the issue is
Can you provide more detail on what you are trying to do, I am unable to reproduce your error running the train_cars.py
script on 2 V100 GPUs?
I conducted this experiment using a custom dataset (X-ray security). Mainly analyzing the relationship between various layers of YOLOV7 (including backbone, head, and feature maps).
In addition to using my own dataset, I also performed some data augmentation operations. But it does not affect the operation of the framework.
I am using a multi GPU (2 * RTX3090) environment. The initial error occurred during the first run. I suspect that during the CUDA runtime, there were no entries left for other functions except for forward function.
I run commands (torchrun --nproc_per_node=2 train_cars.py
) directly in a shell environment and the first error occured.
I loaded postprocess into forward and it works fine!
The error you are getting is when there is an inconsistency between the image ids in the prediction set and the ground truth set, moving the postprocess function inside forward won't fix that. I will try to reproduce and see if I can find what the issue is
This error I think (intuitively, and without querying the source code) is due to the fact that a copy of the model is reopened in cuda (instead of the current model for which the weights are calculated) after a call to self.get_model()
My torch is : 1.12.1+cu113
Ah Ok, you are using a custom dataset. I have a couple of thoughts:
As I am unable to replicate the issue on the dataset the example is designed for, I don't think that this is a code issue so I am going to close this issue. Hopefully you get to the bottom of it
When i run the code with 2 gpus ( ''torchrun --nproc_per_node=2 train_cars.py'' or ''notebook_launcher(main, num_processes=2) in .py file'' ), the error occur:
` File "/data/code/Yolov7-training/yolov7/trainer.py", line 132, in calculate_eval_batch_loss preds = self.model.postprocess(fpn_heads_outputs, conf_thres=0.001) File "/home/xiaoxiong/anaconda3/envs/yolov7/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1207, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'DistributedDataParallel' object has no attribute 'postprocess'
` How can i fix it ? (I also asked in https://github.com/Chris-hughes10/pytorch-accelerated/issues/56)