Problem with the validation step during training: No detections!

ttamyurek commented 4 years ago

Hi,

I'm getting the "No detections!" output message for the validation step while training a retinanet model. This problem started to occur after I updated to v0.2.5. It was working fine when I was using v0.2.3.

I'm training retinanet with my own custom dataset by fine tuning on the COCO checkpoints provided with the 19.04 release. The version I was using was v0.2.3. I later wanted to update to v0.2.5 and retrain the model to see if there would be any improvement. But with all of the parameters being the same, I'm getting "No detections!" message. I tried for both RN50FPN and MobileNet backbones and nothing changed.

Other things I've tried:

I tried to run inference on the trained model and it did return detection results along with COCO scores both for v0.2.3 and v0.2.5. So there seem to be nothing wrong with the training.
I tried to train with COCO dataset from the COCO checkpoint and got the same problem for v0.2.5. But it did return the detections when I ran the code in inference mode

Command I used to run training (Only difference is the train/val datasets)

odtk train retinanet_ResNet50FPN_COCO_test.pth --backbone ResNet50FPN --jitter 360 640 \
    --images /datasets/COCO/val2017 --annotations /datasets/COCO/annotations/instances_val2017.json \
    --val-images /datasets/COCO/val2017 --val-annotations /datasets/COCO/annotations/instances_val2017.json \
    --lr 0.0005 --classes 2 --batch 1 --iters 200000 --resize 640  \
    --fine-tune checkpoints/retinanet_rn50fpn.pth

For inference: odtk infer retinanet_ResNet50FPN_COCO_test.pth --images=/datasets/COCO/val2017 --annotations=/datasets/COCO/annotations/instances_val2017.json

For the docker setup I simply followed the instruction in the ReadMe file. HW: NVIDIA GeForce RTX 2080

ghost commented 4 years ago

Instead of trying version, can you try branch 20.03 and the master? Also how many iterations are you finetuning your model for?

ttamyurek commented 4 years ago

I just tried again with the master and 20.03 branches. I'm running the same command above for training. Works fine for the 20.03 branch, but still doesn't work for master branch. However, inference mode works for both of them and the code outputs the correct detection results for the trained model.

ghost commented 4 years ago

This shouldn't happen, the training code is exactly the same if you are not using dali, which from your command looks like you are not. Do you have 2 separate containers running? Pytorch 20.06 for master and 20.03 for branch 20.03?

ttamyurek commented 4 years ago

No I'm not using dali. No I'm using the same container for both branches. My command history is something like this to test both branches:

cd /retinanet-examples
git checkout 20.03
docker build -t odtk:latest .
docker run --gpus all --rm --ipc=host -it odtk:latest #-v my training dirs

odtk train retinanet_ResNet50FPN_COCO_test.pth --backbone ResNet50FPN --jitter 360 640 \
    --images /datasets/COCO/val2017 --annotations /datasets/COCO/annotations/instances_val2017.json \
    --val-images /datasets/COCO/val2017 --val-annotations /datasets/COCO/annotations/instances_val2017.json \
    --lr 0.0005 --classes 2 --batch 1 --iters 200000 --resize 640  \
    --fine-tune checkpoints/retinanet_rn50fpn.pth
# Works
exit

git checkout master
docker build -t odtk:latest .
docker run --gpus all --rm --ipc=host -it odtk:latest #-v my training dirs

odtk train retinanet_ResNet50FPN_COCO_test.pth --backbone ResNet50FPN --jitter 360 640 \
    --images /datasets/COCO/val2017 --annotations /datasets/COCO/annotations/instances_val2017.json \
    --val-images /datasets/COCO/val2017 --val-annotations /datasets/COCO/annotations/instances_val2017.json \
    --lr 0.0005 --classes 2 --batch 1 --iters 200000 --resize 640  \
    --fine-tune checkpoints/retinanet_rn50fpn.pth

# Doesn't work
# Getting No Detections! error

ghost commented 4 years ago

The master uses 20.06 container, and to run this it is recommended to have CUDA 11 and Driver version 450 installed. Can you do that and then docker pull nvcr.io/nvidia/pytorch:20.06-py3 and while building the docker image please don't use the same tags, it gets confusing. For master use: docker build -t odtk:20.06 . For 20.03 use: docker build -t odtk:20.03 .

ttamyurek commented 4 years ago

Thank you for your suggestion. I do have CUDA 11 and driver version 450 installed in my computer. I tried again with the master branch using 20.06 container. I'm still getting the same behavior. It didn't resolve the issue.

ghost commented 4 years ago

Hello, can confirm that this issue pops up with 20.06.

It happens as soon as backpropagation happens. Maybe the model sent for inference during validation has inaccurate weights.

But the model that gets saved after this call, with a separate infer works fine. Will look more into this issue, any ideas for a fix are welcome.

whria78 commented 4 years ago

I also experienced "No Detection". :(

Should I use 20.03 ?

ttamyurek commented 4 years ago

I also experienced "No Detection". :(

Should I use 20.03 ?

Yes, I'm using 20.03 and it works.

ghost commented 4 years ago

@ttamyurek, @whria78 as a workaround for this issue you can comment these lines while training/validation and work with 20.06/20.07 containers with master branch, while we figure out why the custom layers are returning 0 values, and no detections: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L262-L264 and https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L315-L317

NOTE: This will make the validation slower, so uncomment these lines again while doing inference.

I recommend installing ODTK as pip install --no-cache-dir -e <PATH/TO/THIS/REPO/> for making it easier to comment/uncomment these changes w/o installing this package again and again.

ttamyurek commented 4 years ago

@nakul3112 Do you ever get some detection results at anytime during the training? Could it be diverging?

nakul3112 commented 4 years ago

Got it some times. I mean a table of mAP and mARs. However, on doing training in 20.03 container and using --fine-tune , I was able to train the model. Now, my concern is how do I check from the metrics itself that the model will perform accurately on test images?

ttamyurek commented 4 years ago

The mAP and AR tables that are shown in the console shows the performance of your model on your validation set that you provided as --val-images and --val-annotations. If your test images dont have annotations, you can run inference with your trained model on the images and visualize the results to see how well it performs.

nakul3112 commented 4 years ago

@ttamyurek Thanks for the help. I was able to visualize the results on test images. Got good results.

However, in images, there are places where the bounding box is placed over the cluster of same images. I would want the bounding boxes to be over the individual objects. I think I need to retrain with other parameters to improve the metrics, and then visualize again.

Anyone, Please let me know if there is something else I could do, probably other than retraining.

ttamyurek commented 4 years ago

Use a score threshold when visualizing the detections. For example 0.5

xonobo commented 4 years ago

Managed to get detections by changing the docker's pytorch to 20.07-py3.

nakul3112 commented 4 years ago

Does anyone have idea, whether it is better to train the model with .jpg or .png images? Also, is it recommended to use the same size of test images while running inference as the training image size?

ghost commented 4 years ago

@xonobo I tried with 20/07 and couldn't get the detections during training. Can anyone else confirm that using 20.07 resolves this issue?

james-nvidia commented 4 years ago

@nakul3112 the COCO evaluation metrics are described here: https://cocodataset.org/#detection-eval

As to what mAP you should be getting, it very much depends on your problem. Generally AP (first number) should be above 30%. When I am working with remote sensing data I normally use the AP IOU0.5 (the second number), which I hope to see above 75%.

rbgreenway commented 4 years ago

Was the "No Detections" issue every resolved? I having this issue with odtk:latest.

ghost commented 4 years ago

@rbgreenway please see this comment: https://github.com/NVIDIA/retinanet-examples/issues/219#issuecomment-668883353

dketterer commented 4 years ago

I tested the master branch with pytorch:20.09-py3 and the issue remains.

I used this:

odtk train retinanet_rn18fpn.pth --backbone ResNet18FPN \
    --batch 1 \
    --jitter 256 256 \
    --resize 256 \
    --val-iters 2000 \
    --images /data/datasets/coco/images/train2017/ --annotations /data/datasets/coco/annotations/instances_train2017.json \
    --val-images /data/datasets/coco/images/val2017/ --val-annotations /data/datasets/coco/annotations/instances_val2017.json

dketterer commented 4 years ago

@yashnv is there any progress on this? Are you still investigating the issue?

ghost commented 4 years ago

This issue results from CUB, it shows up as invalid device ordinals. Still can't find another minimum reproducible error.

james-nvidia commented 3 years ago

Closing due to inactivity

ghost commented 3 years ago

Fixed in master, issue was due to this: https://github.com/pytorch/pytorch/issues/52663

NVIDIA / retinanet-examples

Problem with the validation step during training: No detections! #219