Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch?

EternalEvan commented 4 years ago

Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch? Results can be obtained when training with CPU. But when I train my own data with GPU, I will be stuck here. Can someone help me? CPU： 2020-10-29 23:42:12,062 train.py[line:611] INFO: Using device cpu 2020-10-29 23:42:13,583 train.py[line:327] INFO: Starting training: Epochs: 5 Batch size: 4 Subdivisions: 1 Learning rate: 0.001 Training size: 21 Validation size: 4 Checkpoints: True Device: cpu Images size: 608 Optimizer: adam Dataset classes: 3 Train label path:train.txt Pretrained:

Epoch 1/5: 95%|▉| 20/21 [10:25<00:31, 31.65s/img]in function convert_to_coco_api... You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. creating index... index created! You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. Accumulating evaluation results... DONE (t=0.13s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000：

GPU： 2020-10-30 13:54:17,456 train.py[line:611] INFO: Using device cuda 2020-10-30 13:54:20,094 train.py[line:327] INFO: Starting training: Epochs: 5 Batch size: 4 Subdivisions: 1 Learning rate: 0.001 Training size: 21 Validation size: 4 Checkpoints: True Device: cuda Images size: 608 Optimizer: adam Dataset classes: 3 Train label path:train.txt Pretrained:

Epoch 1/5: 95%|▉| 20/21 [00:17<00:01, 1.01s/img]in function convert_to_coco_api... You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. creating index... index created!

uniyushu commented 3 years ago

I meet the same problem. Do you find anything?

gytdau commented 3 years ago

This might be because your evaluation dataset is large. It appears the evaluation dataset is evaluated on the CPU, though I'm not so sure - that would be one explanation for the slowness.

swxu commented 3 years ago

Try set dataloader num_workers=0. val_loader = DataLoader(val_dataset, batch_size=config.batch // config.subdivisions, shuffle=True, num_workers=0, pin_memory=True, drop_last=True, collate_fn=val_collate) It seems like a bug in pytorch dataloder.

asebaq commented 3 years ago

You need to change Pytorch version, I changed it to 1.5.0, and the train.py run successfully with GPU

jcmayoral commented 2 years ago

@swxu @asebaq I was going crazy debugging. Thanks.

Tianxiaomo / pytorch-YOLOv4

Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch? #311