VainF / Torch-Pruning

[CVPR 2023] Towards Any Structural Pruning; LLMs / SAM / Diffusion / Transformers / YOLOv8 / CNNs
https://arxiv.org/abs/2301.12900
MIT License
2.62k stars 324 forks source link

yolov8-pose cuda error #308

Open MrJoratos opened 9 months ago

MrJoratos commented 9 months ago

the error is as following: albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8)) val: Scanning /media/hitcrt/6a071232-a52f-4f53-89ca-fdde738abfd8/assignment10_19/data_original/4kp_data/labeled/rgb/南航.cache... 704 images, 1 Plotting labels to runs/pose/step_0_finetune11/labels.jpg... optimizer: AdamW(lr=0.000476, momentum=0.9) with parameter groups 63 weight(decay=0.0), 83 weight(decay=0.0005), 82 bias(decay=0.0) Image sizes 928 train, 928 val Using 8 dataloader workers Logging results to runs/pose/step_0_finetune11 Starting training for 10 epochs... Closing dataloader mosaic albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))

  Epoch    GPU_mem   box_loss  pose_loss  kobj_loss   cls_loss   dfl_loss  Instances       Size

0%| | 0/1329 [00:00<?, ?it/s] Traceback (most recent call last): File "torch-Pruning.py", line 403, in prune(args) File "torch-Pruning.py", line 359, in prune model.train_v2(pruning=True, pruning_cfg) File "torch-Pruning.py", line 267, in train_v2 self.trainer.train() File "/media/hitcrt/6a071232-a52f-4f53-89ca-fdde738abfd8/224d2d601bc345007a991aa1b40b8bde.jpeg{824E0F58-7501-AA9A-975F-E71FEA341EF3}.pngultralytics-8.0.132/ultralytics-8.0.132/ultralytics/yolo/engine/trainer.py", line 192, in train self._do_train(world_size) File "/media/hitcrt/6a071232-a52f-4f53-89ca-fdde738abfd8/224d2d601bc345007a991aa1b40b8bde.jpeg{824E0F58-7501-AA9A-975F-E71FEA341EF3}.pngultralytics-8.0.132/ultralytics-8.0.132/ultralytics/yolo/engine/trainer.py", line 332, in _do_train self.loss, self.loss_items = self.model(batch) File "/home/hitcrt/anaconda3/envs/py381/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/media/hitcrt/6a071232-a52f-4f53-89ca-fdde738abfd8/224d2d601bc345007a991aa1b40b8bde.jpeg{824E0F58-7501-AA9A-975F-E71FEA341EF3}.pngultralytics-8.0.132/ultralytics-8.0.132/ultralytics/nn/tasks.py", line 44, in forward return self.loss(x, args, kwargs) File "/media/hitcrt/6a071232-a52f-4f53-89ca-fdde738abfd8/224d2d601bc345007a991aa1b40b8bde.jpeg{824E0F58-7501-AA9A-975F-E71FEA341EF3}.pngultralytics-8.0.132/ultralytics-8.0.132/ultralytics/nn/tasks.py", line 215, in loss return self.criterion(preds, batch) File "/media/hitcrt/6a071232-a52f-4f53-89ca-fdde738abfd8/224d2d601bc345007a991aa1b40b8bde.jpeg{824E0F58-7501-AA9A-975F-E71FEA341EF3}.pngultralytics-8.0.132/ultralytics-8.0.132/ultralytics/utils/loss.py", line 335, in call pred_bboxes = self.bbox_decode(anchor_points, pred_distri) # xyxy, (b, h*w, 4) File "/media/hitcrt/6a071232-a52f-4f53-89ca-fdde738abfd8/224d2d601bc345007a991aa1b40b8bde.jpeg{824E0F58-7501-AA9A-975F-E71FEA341EF3}.pngultralytics-8.0.132/ultralytics-8.0.132/ultralytics/utils/loss.py", line 150, in bbox_decode pred_dist = pred_dist.view(b, a, 4, c // 4).softmax(3).matmul(self.proj.type(pred_dist.dtype)) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_mm)

MrJoratos commented 9 months ago

and the yolov8 is old version

MrJoratos commented 9 months ago

it's normal when running validation for the first two steps(although there is no gpu memory occupied by python), when coming to training, this error abbrupts

J0eky commented 9 months ago

@MrJoratos Hi, have you solved the problem?

Reaidu commented 8 months ago

I also encountered this problem, my task is to detect, when I use the official yolov8n's dataset, I can prune and post-train normally, but when I use my own trained dataset to fetch the pruning pruning normally, when I can't post-train it, it reports the error: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!, and don't understand the logic of this! Woohoo!