Tianxiaomo / pytorch-YOLOv4

PyTorch ,ONNX and TensorRT implementation of YOLOv4
Apache License 2.0
4.48k stars 1.49k forks source link

Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch? #311

Open EternalEvan opened 4 years ago

EternalEvan commented 4 years ago

Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch? Results can be obtained when training with CPU. But when I train my own data with GPU, I will be stuck here. Can someone help me? CPU: 2020-10-29 23:42:12,062 train.py[line:611] INFO: Using device cpu 2020-10-29 23:42:13,583 train.py[line:327] INFO: Starting training: Epochs: 5 Batch size: 4 Subdivisions: 1 Learning rate: 0.001 Training size: 21 Validation size: 4 Checkpoints: True Device: cpu Images size: 608 Optimizer: adam Dataset classes: 3 Train label path:train.txt Pretrained:

Epoch 1/5: 95%|▉| 20/21 [10:25<00:31, 31.65s/img]in function convert_to_coco_api... You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. creating index... index created! You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. Accumulating evaluation results... DONE (t=0.13s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000:

GPU: 2020-10-30 13:54:17,456 train.py[line:611] INFO: Using device cuda 2020-10-30 13:54:20,094 train.py[line:327] INFO: Starting training: Epochs: 5 Batch size: 4 Subdivisions: 1 Learning rate: 0.001 Training size: 21 Validation size: 4 Checkpoints: True Device: cuda Images size: 608 Optimizer: adam Dataset classes: 3 Train label path:train.txt Pretrained:

Epoch 1/5: 95%|▉| 20/21 [00:17<00:01, 1.01s/img]in function convert_to_coco_api... You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. creating index... index created!

uniyushu commented 3 years ago

I meet the same problem. Do you find anything?

gytdau commented 3 years ago

This might be because your evaluation dataset is large. It appears the evaluation dataset is evaluated on the CPU, though I'm not so sure - that would be one explanation for the slowness.

swxu commented 3 years ago

Try set dataloader num_workers=0. val_loader = DataLoader(val_dataset, batch_size=config.batch // config.subdivisions, shuffle=True, num_workers=0, pin_memory=True, drop_last=True, collate_fn=val_collate) It seems like a bug in pytorch dataloder.

asebaq commented 3 years ago

You need to change Pytorch version, I changed it to 1.5.0, and the train.py run successfully with GPU

jcmayoral commented 2 years ago

@swxu @asebaq I was going crazy debugging. Thanks.