QAT programm killed everytime

11061995 commented 4 months ago

When ever i am running this command python scripts/qat.py quantize yolov5s.pt --ptq=ptq.pt --qat=qat.pt --cocodir=datasets/coco --eval-ptq --eval-origin --all-node-with-qdq program get killed after 5 epochs. system conf: CUDA12.2 python 3.10 torch2.3.

liuanqi-libra7 commented 1 month ago

Is there any error log when the program is killed?

11061995 commented 3 weeks ago

Namespace(cmd='quantize', weight='yolov5s.pt', cocodir='datasets/coco', device='cuda:0', ignore_policy='None', ptq='ptq.pt', qat='qat.pt', supervision_stride=1, iters=200, eval_origin=True, eval_ptq=True, all_node_with_qdq=True)

             from  n    params  module                                  arguments

0 -1 1 1 -1 1 2 -1 1 3 -1 1 4 5 6 7 8 9 10 11 -1 1 12 [-1, 6] 1 13 14 15 -1 1 16 [-1, 4] 1 17 18 19 [-1, 14] 1 20 21 22 [-1, 10] 1 23 24 [17, 20, 23] 1 YOLOv5s summary: 3520 models.common.Conv [3, 32, 6, 2, 2]
18560 models.common.Conv [32, 64, 3, 2]
18816 models.common.C3 [64, 64, 1]
73984 models.common.Conv [64, 128, 3, 2]
-1 2 115712 models.common.C3 [128, 128, 2]
-1 1 295424 models.common.Conv [128, 256, 3, 2]
-1 3 625152 models.common.C3 [256, 256, 3]
-1 1 1180672 models.common.Conv [256, 512, 3, 2]
-1 1 1182720 models.common.C3 [512, 512, 1]
-1 1 656896 models.common.SPPF [512, 512, 5]
-1 1 131584 models.common.Conv [512, 256, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 361984 models.common.C3 [512, 256, 1, False]
-1 1 33024 models.common.Conv [256, 128, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 90880 models.common.C3 [256, 128, 1, False]
-1 1 147712 models.common.Conv [128, 128, 3, 2]
0 models.common.Concat [1]
-1 1 296448 models.common.C3 [256, 256, 1, False]
-1 1 590336 models.common.Conv [256, 256, 3, 2]
0 models.common.Concat [1]
-1 1 1182720 models.common.C3 [512, 512, 1, False]
229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] 223 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs

Fusing layers... YOLOv5s summary: 166 layers, 7225885 parameters, 229245 gradients, 16.4 GFLOPs Scanning datasets/coco/train2017.cache... 117266 images, 1021 backgrounds, 0 corrupt: 100%|██████████| 118287/118287 00:00 WARNING ⚠️ datasets/coco/images/train2017/000000099844.jpg: 2 duplicate labels removed WARNING ⚠️ datasets/coco/images/train2017/000000201706.jpg: 1 duplicate labels removed WARNING ⚠️ datasets/coco/images/train2017/000000214087.jpg: 1 duplicate labels removed WARNING ⚠️ datasets/coco/images/train2017/000000522365.jpg: 1 duplicate labels removed Scanning datasets/coco/val2017.cache... 4952 images, 48 backgrounds, 0 corrupt: 100%|██████████| 5000/5000 00:00 Add QuantAdd to model.2.m.0 Add QuantAdd to model.4.m.0 Add QuantAdd to model.4.m.1 Add QuantAdd to model.6.m.0 Add QuantAdd to model.6.m.1 Add QuantAdd to model.6.m.2 Add QuantAdd to model.8.m.0 Collect stats for calibrating: 100%|████████████████████████████████████████████████████████████████| 25/25 [01:29<00:00, 3.58s/it] Evaluate Origin... Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 500/500 06:09 all 5000 36335 0.664 0.516 0.562 0.372

Evaluating pycocotools mAP... saving _predictions.json... loading annotations into memory... Done (t=1.19s) creating index... index created! Loading and preparing results... Killed

This is all i got

NVIDIA-AI-IOT / cuDLA-samples

QAT programm killed everytime #38