Closed nemtiax closed 4 years ago
I think I have narrowed down my issue a bit, to these lines in PredLayer:
penalty_mask[b,best_n,truth_j,truth_i] = 1
obj_mask[b,best_n,truth_j,truth_i] = 1
Adding some prints just before shows these sizes and values:
B: 0
BEST_N: tensor([1, 1, 1, 1, 2, 2, 0, 0], device='cuda:0')
TRUTH_J: tensor([47, 49, 51, 56, 43, 51, 37, 48], device='cuda:0')
TRUTH_I: tensor([37, 42, 58, 85, 30, 56, 13, 41], device='cuda:0')
PENALTY MASK SHAPE: torch.Size([1, 3, 84, 84])
OBJ MASK SHAPE: torch.Size([1, 3, 84, 84])
Note that truth_i contains the value 85, but the relevant dimension of penalty_mask and obj_mask is only size 84. I'm not sure I understand what truth_i represents in this code, so I'm having trouble tracking down where this mismatch might be coming from.
It looks like the problem is with the labels tensor. This is from a different run (so different problem sample than above):
LABELS SHAPE: torch.Size([1, 50, 5])
Labels: tensor([[[ 0.5132, 1.0385, 0.2189, 0.4048, -75.2826],
...
...
Note that the y coordinate of the label box is >1, which I think then ends up in t_j as a value that's above the grid size when it does this:
tx_all, ty_all = labels[:,:,0] * nG, labels[:,:,1] * nG # 0-nG
Is this maybe occuring when it tries to rotate coco boxes and move something outside the image bounds or something like that? Is there some preprocessing I need to do on COCO?
One more update (apologies if anyone watching the repo is getting spammed with notifications). If I turn off augmentation by setting enable_aug = False in train.py, this problem seems to go away. My reading of the paper was that I should have augmentation enabled for both COCO training and overhead finetuning ("Rotation, flipping, resizing, and color augmentation are used in both training stages"). Maybe I need to do something to preprocess coco images to make them amenable to rotation (maybe pad to be a square)?
I think I cracked the mystery.
https://github.com/duanzhiihao/RAPiD/blob/55b9a62739f6480bdc84fbff31f2c92d776030ba/datasets.py#L131
self.coco = True if 'COCO' in img_path else False
I didn't happen to put my COCO images in a path named COCO, and so the dataset would decide I'm not using COCO whenever it went to load an image. Then the augmentation code would figure that since I'm not using COCO, it doesn't have to pad. This would lead to bounding boxes getting rotated out of the image frame, which means they get assigned to an anchor ID that doesn't exist, leading to the index out of bounds exception.
It might be nice to add an assertion in train.py that verifies your image path contains the string COCO if you set dataset=COCO, so other people don't run into this problem.
I didn't happen to put my COCO images in a path named COCO, and so the dataset would decide I'm not using COCO whenever it went to load an image. Then the augmentation code would figure that since I'm not using COCO, it doesn't have to pad. This would lead to bounding boxes getting rotated out of the image frame, which means they get assigned to an anchor ID that doesn't exist, leading to the index out of bounds exception.
That is exactly correct!
It might be nice to add an assertion in train.py that verifies your image path contains the string COCO if you set dataset=COCO, so other people don't run into this problem.
Thank you for your advice. I'll do it in an upcoming commit.
I think I cracked the mystery.
https://github.com/duanzhiihao/RAPiD/blob/55b9a62739f6480bdc84fbff31f2c92d776030ba/datasets.py#L131
self.coco = True if 'COCO' in img_path else False
I didn't happen to put my COCO images in a path named COCO, and so the dataset would decide I'm not using COCO whenever it went to load an image. Then the augmentation code would figure that since I'm not using COCO, it doesn't have to pad. This would lead to bounding boxes getting rotated out of the image frame, which means they get assigned to an anchor ID that doesn't exist, leading to the index out of bounds exception.
It might be nice to add an assertion in train.py that verifies your image path contains the string COCO if you set dataset=COCO, so other people don't run into this problem.
thanks bro, I made the same mistake with u. it ought to take me hours if without your help
I'm trying to train from scratch, starting with training on COCO, but I'm getting an index out of bounds error somewhere in PredLayer.forward() (for what it's worth, I don't think the reported line 285 is necessarily the true source of the error - it might just be where cpu execution got to before the gpu complained). The only changes I made to set up training were to modify the paths to the training and validation sets in train.py:
I assume the root cause of this issue is some mistake I made in setting things up, or maybe forgetting to specify some key parameter, but I haven't been able to track it down yet. I figured it might be worth checking if anyone has run into this same problem and already knows how to fix it, or what I might've done wrong?