Index Out of Bounds error when training on COCO

nemtiax commented 4 years ago

I'm trying to train from scratch, starting with training on COCO, but I'm getting an index out of bounds error somewhere in PredLayer.forward() (for what it's worth, I don't think the reported line 285 is necessarily the true source of the error - it might just be where cpu execution got to before the gpu complained). The only changes I made to set up training were to modify the paths to the training and validation sets in train.py:

   if args.dataset == 'COCO':
        train_img_dir = '../train2017'
        train_json = '../annotations/instances_train2017.json'
        val_img_dir = '../CEPDOF/Lunch1'
        val_json = '../CEPDOF/annotations/Lunch1.json'

I assume the root cause of this issue is some mistake I made in setting things up, or maybe forgetting to specify some key parameter, but I haven't been able to track it down yet. I figured it might be worth checking if anyone has run into this same problem and already knows how to fix it, or what I might've done wrong?

Only train on person images and object: True
effective batch size = 1 * 128
initialing dataloader...
Only train on person images and objects
Loading annotations ../annotations/instances_train2017.json into memory...
Training on perspective images; adding angle to BBs
Using backbone Darknet-53. Loading ImageNet weights....
Number of parameters in backbone: 40584928
/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "train.py", line 253, in <module>
    loss = model(imgs, targets, labels_cats=cats)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/RAPiD_clean/RAPiD/models/rapid.py", line 78, in forward
    boxes_S, loss_S = self.pred_S(detect_S, self.img_size, labels)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/RAPiD_clean/RAPiD/models/rapid.py", line 285, in forward
    target[b,best_n,truth_j,truth_i,0] = tx_all[b,:n][valid_mask] - tx_all[b,:n][valid_mask].floor()
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered(pytorch_p36)

nemtiax commented 4 years ago

I think I have narrowed down my issue a bit, to these lines in PredLayer:

            penalty_mask[b,best_n,truth_j,truth_i] = 1
            obj_mask[b,best_n,truth_j,truth_i] = 1

Adding some prints just before shows these sizes and values:

B: 0
BEST_N: tensor([1, 1, 1, 1, 2, 2, 0, 0], device='cuda:0')
TRUTH_J: tensor([47, 49, 51, 56, 43, 51, 37, 48], device='cuda:0')
TRUTH_I: tensor([37, 42, 58, 85, 30, 56, 13, 41], device='cuda:0')
PENALTY MASK SHAPE: torch.Size([1, 3, 84, 84])
OBJ MASK SHAPE: torch.Size([1, 3, 84, 84])

Note that truth_i contains the value 85, but the relevant dimension of penalty_mask and obj_mask is only size 84. I'm not sure I understand what truth_i represents in this code, so I'm having trouble tracking down where this mismatch might be coming from.

nemtiax commented 4 years ago

It looks like the problem is with the labels tensor. This is from a different run (so different problem sample than above):

LABELS SHAPE: torch.Size([1, 50, 5])
Labels: tensor([[[  0.5132,   1.0385,   0.2189,   0.4048, -75.2826],
...
...

Note that the y coordinate of the label box is >1, which I think then ends up in t_j as a value that's above the grid size when it does this:

tx_all, ty_all = labels[:,:,0] * nG, labels[:,:,1] * nG # 0-nG

Is this maybe occuring when it tries to rotate coco boxes and move something outside the image bounds or something like that? Is there some preprocessing I need to do on COCO?

nemtiax commented 4 years ago

One more update (apologies if anyone watching the repo is getting spammed with notifications). If I turn off augmentation by setting enable_aug = False in train.py, this problem seems to go away. My reading of the paper was that I should have augmentation enabled for both COCO training and overhead finetuning ("Rotation, flipping, resizing, and color augmentation are used in both training stages"). Maybe I need to do something to preprocess coco images to make them amenable to rotation (maybe pad to be a square)?

nemtiax commented 4 years ago

I think I cracked the mystery.

https://github.com/duanzhiihao/RAPiD/blob/55b9a62739f6480bdc84fbff31f2c92d776030ba/datasets.py#L131

self.coco = True if 'COCO' in img_path else False

I didn't happen to put my COCO images in a path named COCO, and so the dataset would decide I'm not using COCO whenever it went to load an image. Then the augmentation code would figure that since I'm not using COCO, it doesn't have to pad. This would lead to bounding boxes getting rotated out of the image frame, which means they get assigned to an anchor ID that doesn't exist, leading to the index out of bounds exception.

It might be nice to add an assertion in train.py that verifies your image path contains the string COCO if you set dataset=COCO, so other people don't run into this problem.

duanzhiihao commented 4 years ago

I didn't happen to put my COCO images in a path named COCO, and so the dataset would decide I'm not using COCO whenever it went to load an image. Then the augmentation code would figure that since I'm not using COCO, it doesn't have to pad. This would lead to bounding boxes getting rotated out of the image frame, which means they get assigned to an anchor ID that doesn't exist, leading to the index out of bounds exception.

That is exactly correct!

It might be nice to add an assertion in train.py that verifies your image path contains the string COCO if you set dataset=COCO, so other people don't run into this problem.

Thank you for your advice. I'll do it in an upcoming commit.

nagisawarm commented 11 months ago

I think I cracked the mystery.

https://github.com/duanzhiihao/RAPiD/blob/55b9a62739f6480bdc84fbff31f2c92d776030ba/datasets.py#L131
self.coco = True if 'COCO' in img_path else False
I didn't happen to put my COCO images in a path named COCO, and so the dataset would decide I'm not using COCO whenever it went to load an image. Then the augmentation code would figure that since I'm not using COCO, it doesn't have to pad. This would lead to bounding boxes getting rotated out of the image frame, which means they get assigned to an anchor ID that doesn't exist, leading to the index out of bounds exception.

It might be nice to add an assertion in train.py that verifies your image path contains the string COCO if you set dataset=COCO, so other people don't run into this problem.

thanks bro, I made the same mistake with u. it ought to take me hours if without your help

duanzhiihao / RAPiD

Index Out of Bounds error when training on COCO #11