Trouble with fine-tuning

nemtiax commented 3 years ago

I've trained a model on COCO for 50k iterations (I know the paper says 100k, but that takes a long time, and I wanted to verify that i'm on the right track), and now I'm trying to fine-tune it on HABBOF+CEPDOF. COCO training seemed to progress well, I used one split of CEPDOF (Lunch1) as my validation data:

0.9 AP50 on a CEPDOF split seems great, given that it's only trained on COCO images so far. (ignore the little zigzag at the start - I forgot to reset the logs after a misconfigured run).

I had to do the following steps to get it to run:

-Convert HABBOF annotation text files into a CEPDOF-style json annotation file. This was pretty straight forward, although it did require enforcing the CEPDOF restriction that r<90 (HABBOF appears to allow r<=90). I just swapped all r=90 instances to r=-90.
-Rename a few CEPDOF folders to match the names in train.py. In particular, CEPDOF has "High_activity" whereas train.py has "Activity", and CEPDOF has "Edge_Cases" whereas train.py has "Edge_cases".

I then ran train.py again, pointing it to my previous COCO checkpoint.

python train.py --dataset=H1H2 --batch_size=8 --checkpoint=rapid_pL1_dark53_COCO608_Oct15_52000.ckpt

However, I'm getting a crash at the end of the first validation section:

Total time: 0:00:09.982568, iter: 0:00:09.982568, epoch: 2:33:45.084843
[Iteration -1] [learning rate 4e-10] [Total loss 373.04] [img size 672]
level_21 total 7 objects: xy/gt 1.204, wh/gt 0.024, angle/gt 0.337, conf 44.211
level_42 total 24 objects: xy/gt 1.111, wh/gt 0.026, angle/gt 0.268, conf 78.706
level_84 total 35 objects: xy/gt 1.319, wh/gt 0.028, angle/gt 0.464, conf 142.954
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:36<00:00,  4.52it/s]
accumulating results
Traceback (most recent call last):
  File "train.py", line 228, in <module>
    str_0 = val_set.evaluate_dtList(dts, metric='AP')
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 77, in evaluate_dtList
    self._accumulate(**kwargs)
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 204, in _accumulate
    assert ((tp_sum[:,-1] + fp_sum[:,-1]) == num_dt).all()
IndexError: index -1 is out of bounds for dimension 1 with size 0Total time: 0:00:09.982568, iter: 0:00:09.982568, epoch: 2:33:45.084843
[Iteration -1] [learning rate 4e-10] [Total loss 373.04] [img size 672]
level_21 total 7 objects: xy/gt 1.204, wh/gt 0.024, angle/gt 0.337, conf 44.211
level_42 total 24 objects: xy/gt 1.111, wh/gt 0.026, angle/gt 0.268, conf 78.706
level_84 total 35 objects: xy/gt 1.319, wh/gt 0.028, angle/gt 0.464, conf 142.954
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:36<00:00,  4.52it/s]
accumulating results
Traceback (most recent call last):
  File "train.py", line 228, in <module>
    str_0 = val_set.evaluate_dtList(dts, metric='AP')
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 77, in evaluate_dtList
    self._accumulate(**kwargs)
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 204, in _accumulate
    assert ((tp_sum[:,-1] + fp_sum[:,-1]) == num_dt).all()
IndexError: index -1 is out of bounds for dimension 1 with size 0

After adding a few print statements in MWtools/_accumulate, I think the problem is that I'm getting no detections:


Total time: 0:00:09.720600, iter: 0:00:09.720600, epoch: 2:29:42.951354
[Iteration -1] [learning rate 4e-10] [Total loss 188.62] [img size 672]
level_21 total 10 objects: xy/gt 1.033, wh/gt 0.013, angle/gt 0.350, conf 17.059
level_42 total 18 objects: xy/gt 1.259, wh/gt 0.015, angle/gt 0.209, conf 45.737
level_84 total 15 objects: xy/gt 1.318, wh/gt 0.016, angle/gt 0.229, conf 62.061
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:07<00:00,  4.88it/s]
accumulating results
NUM_GT: 6917
TPS tensor([], size=(10, 0), dtype=torch.bool)
FPS tensor([], size=(10, 0), dtype=torch.bool)
NUM_DT 0
TP_SUM tensor([], size=(10, 0))
FP_SUM tensor([], size=(10, 0))

Any advice on what might be going wrong here? Have I missed a step in setting up fine-tuning?

Also, would it be possible to make your pre-finetuning COCO checkpoint available? It'd save me a lot of training time.

duanzhiihao commented 3 years ago

After adding a few print statements in MWtools/_accumulate, I think the problem is that I'm getting no detections:

I agree with you. Could you try to fine-tune on 'H1H2' (btw, I renamed it to 'HBCP' in the latest commit) using the weights that I posted in the REAME.md? For example,

python train.py --dataset=H1H2 --batch_size=8 --checkpoint=pL1_HBCP608_Apr14_6000.ckpt

If that solves the problem, it indicates that your weights pre-trained on COCO somehow contains some bugs.

Regarding my COCO checkpoint, sorry that I forgot to upload it. I will upload it in the following days. (The server that contains my COCO checkpoint is under maintenance so I don't have access the COCO ckpt for now)

nemtiax commented 3 years ago

Thanks, I'll give it a try and report back!

I appreciate all your fast replies!

nemtiax commented 3 years ago

I suspect that the issue is something to do with HABBOF (and so probably related to my preprocessing of it). I tried re-starting fine-tuning but using CEPDOF Lunch1 as my val split, and it made plenty of detections. When using HABBOF, it reports the correct number of ground truth annotations, I think, so it seems to be read my annotation file correctly. But maybe there's an issue with how I've named or stored my images.

nemtiax commented 3 years ago

I think the culprit likely lies here:

https://github.com/duanzhiihao/RAPiD/blob/958bc5b8fd0fe2d34e5e25f822e6eb2a9533c829/utils/dataloader.py#L103

When converting HABBOF to CEPDOF format, I didn't enforce that image file names should match image IDs. Each HABBOF split has the same file names (dddddd.jpg), so they can't be used as IDs. I just prepended the ID with the folder name, but didn't rename the files to match. I will try fixing this and see if it resolves my issue.

nemtiax commented 3 years ago

Renaming my HABBOF files so that every file has a unique name and changing my converted HABBOF annotations so that the image ID field always matches the file name seems to have resolved this issue.

I think what is happening is that there are two code paths involved in evaluation. One is responsible for reading the annotation file, and mapping image_id -> ground truth. The other is responsible for reading the image files, running inference and mapping image_id -> detections. But the inference branch doesn't know what the image IDs are - those only exist in the annotation file. So it assumes that image_id is just the file name minus the .jpg. This is fine for CEPDOF, because that's always true. But if they don't match, then when the evaluator goes to fetch your detections, it loops over only the IDs that were in the annotations, and checks for detections matching those IDs. Since the filenames are different form the image IDs, it doesn't find detections for any IDs.

The ideal fix would probably be to have the inference branch map fileName -> detections, and then have the evaluator lookup the appropriate image id for each file name. A quick fix might be to add a warning to the README saying that this must be enforced, and maybe add an assertion to the evaluation script saying that the map coming from the detection branch must have a key for fileID.

twmht commented 2 years ago

@duanzhiihao

any update on this?

are you going to release the finetune part in README?

duanzhiihao commented 2 years ago

Thank you for your attention. I forgot to release the docs for fine-tuning since I was moving to another project.

The fine-tuning should be straightforward: python train.py --dataset=H1H2 --batch_size=8 --checkpoint=pL1_HBCP608_Apr14_6000.ckpt

I will update the readme at some time after I rerun the experiment and verify the results.

twmht commented 2 years ago

@duanzhiihao

what's new project were you working on? is still fisheye related project?

fabiorigano commented 2 years ago

Hi @duanzhiihao,

Regarding my COCO checkpoint, sorry that I forgot to upload it. I will upload it in the following days. (The server that contains my COCO checkpoint is under maintenance so I don't have access the COCO ckpt for now)

Where can I find your COCO checkpoint? Can you upload link on this repository please?

duanzhiihao commented 2 years ago

Hi, please see the README for a newly uploaded COCO pre-trained checkpoint. This checkpoint is, however, sub-optimal as it is only trained for 20k iterations, while more than 100k iterations would be ideal. I didn't train it for longer since I don't have enough computational resources for now. I believe further tuning it on COCO and/or fisheye datasets will significantly improve performace.

fabiorigano commented 2 years ago

Thank you!

duanzhiihao / RAPiD

Trouble with fine-tuning #13