SpursLipu / YOLOv3v4-ModelCompression-MultidatasetTraining-Multibackbone

YOLO ModelCompression MultidatasetTraining
GNU General Public License v3.0
445 stars 136 forks source link

test.py RuntimeError: CUDA error: device-side assert triggered #62

Open yangzhaojason opened 4 years ago

walzimmer commented 3 years ago

I am getting this error when training on the visdrone dataset:

python train.py --data data/visdrone.data --cfg cfg/yolov4/yolov4-visdrone.cfg --batch-size 1 --img-size 512 --epochs 300 --device 0 --weights ""

here is the full log:

Namespace(BN_Fold=False, FPGA=False, KDstr=-1, a_bit=8, adam=False, batch_size=1, bucket='', cache_images=False, cfg='./cfg/yolov4/yolov4-visdrone.cfg', data='data/visdrone.data', device='0', ema=False, epochs=300, evolve=False, fencemask=False, img_size=[512], multi_scale=False, name='', nosave=False, notest=False, prune=-1, pt=False, quantized=0, rect=False, resume=False, s=0.001, single_cls=False, sr=False, t_cfg='', t_weights='', w_bit=8, weights='') Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 2070', total_memory=7982MB)

Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/ Model Summary: 327 layers, 6.39862e+07 parameters, 6.39862e+07 gradients Optimizer groups: 110 .bias, 110 Conv2d.weight, 107 other Caching labels (6471 found, 0 missing, 0 empty, 5 duplicate, for 6471 images): 100%|██████████| 6471/6471 [00:01<00:00, 5131.84it/s] Caching labels: 0%| | 0/548 [00:00<?, ?it/s]single-gpu sparse Caching labels (548 found, 0 missing, 0 empty, 0 duplicate, for 548 images): 100%|██████████| 548/548 [00:00<00:00, 4209.30it/s] 0%| | 0/6471 [00:00<?, ?it/s]Image sizes 512 - 512 train, 512 test Using 0 dataloader workers Starting training for 300 epochs...

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size

0%| | 0/6471 [00:00<?, ?it/s] Traceback (most recent call last): File "train.py", line 1008, in train(hyp) # train normally File "train.py", line 344, in train loss, loss_items = compute_loss(pred, targets, model) File "utils/utils.py", line 419, in compute_loss lobj += BCEobj(pi[..., 4], tobj) # obj loss File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "lib/python3.8/site-packages/torch/nn/modules/loss.py", line 629, in forward return F.binary_cross_entropy_with_logits(input, target, File "lib/python3.8/site-packages/torch/nn/functional.py", line 2582, in binary_cross_entropy_with_logits return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum) RuntimeError: CUDA error: device-side assert triggered /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [3,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [25,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

Process finished with exit code 1 `

walzimmer commented 3 years ago

I have solved this error by removing annotations (from the visdrone dataset) that contain the object class "ignored-region" or "other" to match the number of classes in the visdrone.data file:

classes= 10
train=data/visdrone/train.txt
valid=data/visdrone/test.txt
names=data/visdrone.names