Closed ttamyurek closed 3 years ago
Instead of trying version, can you try branch 20.03
and the master? Also how many iterations are you finetuning your model for?
I just tried again with the master and 20.03 branches. I'm running the same command above for training. Works fine for the 20.03 branch, but still doesn't work for master branch. However, inference mode works for both of them and the code outputs the correct detection results for the trained model.
This shouldn't happen, the training code is exactly the same if you are not using dali, which from your command looks like you are not. Do you have 2 separate containers running? Pytorch 20.06 for master and 20.03 for branch 20.03?
No I'm not using dali. No I'm using the same container for both branches. My command history is something like this to test both branches:
cd /retinanet-examples
git checkout 20.03
docker build -t odtk:latest .
docker run --gpus all --rm --ipc=host -it odtk:latest #-v my training dirs
odtk train retinanet_ResNet50FPN_COCO_test.pth --backbone ResNet50FPN --jitter 360 640 \
--images /datasets/COCO/val2017 --annotations /datasets/COCO/annotations/instances_val2017.json \
--val-images /datasets/COCO/val2017 --val-annotations /datasets/COCO/annotations/instances_val2017.json \
--lr 0.0005 --classes 2 --batch 1 --iters 200000 --resize 640 \
--fine-tune checkpoints/retinanet_rn50fpn.pth
# Works
exit
git checkout master
docker build -t odtk:latest .
docker run --gpus all --rm --ipc=host -it odtk:latest #-v my training dirs
odtk train retinanet_ResNet50FPN_COCO_test.pth --backbone ResNet50FPN --jitter 360 640 \
--images /datasets/COCO/val2017 --annotations /datasets/COCO/annotations/instances_val2017.json \
--val-images /datasets/COCO/val2017 --val-annotations /datasets/COCO/annotations/instances_val2017.json \
--lr 0.0005 --classes 2 --batch 1 --iters 200000 --resize 640 \
--fine-tune checkpoints/retinanet_rn50fpn.pth
# Doesn't work
# Getting No Detections! error
The master uses 20.06 container, and to run this it is recommended to have CUDA 11 and Driver version 450 installed.
Can you do that and then docker pull nvcr.io/nvidia/pytorch:20.06-py3
and while building the docker image please don't use the same tags, it gets confusing.
For master use: docker build -t odtk:20.06 .
For 20.03 use: docker build -t odtk:20.03 .
Thank you for your suggestion. I do have CUDA 11 and driver version 450 installed in my computer. I tried again with the master branch using 20.06 container. I'm still getting the same behavior. It didn't resolve the issue.
Hello, can confirm that this issue pops up with 20.06.
It happens as soon as backpropagation happens. Maybe the model sent for inference during validation has inaccurate weights.
But the model that gets saved after this call, with a separate infer works fine. Will look more into this issue, any ideas for a fix are welcome.
I also experienced "No Detection". :(
Should I use 20.03 ?
I also experienced "No Detection". :(
Should I use 20.03 ?
Yes, I'm using 20.03 and it works.
@ttamyurek, @whria78 as a workaround for this issue you can comment these lines while training/validation and work with 20.06/20.07 containers with master branch, while we figure out why the custom layers are returning 0 values, and no detections: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L262-L264 and https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L315-L317
NOTE: This will make the validation slower, so uncomment these lines again while doing inference.
I recommend installing ODTK as pip install --no-cache-dir -e <PATH/TO/THIS/REPO/>
for making it easier to comment/uncomment these changes w/o installing this package again and again.
@nakul3112 Do you ever get some detection results at anytime during the training? Could it be diverging?
Got it some times. I mean a table of mAP and mARs. However, on doing training in 20.03 container and using --fine-tune , I was able to train the model. Now, my concern is how do I check from the metrics itself that the model will perform accurately on test images?
The mAP and AR tables that are shown in the console shows the performance of your model on your validation set that you provided as --val-images
and --val-annotations
. If your test images dont have annotations, you can run inference with your trained model on the images and visualize the results to see how well it performs.
@ttamyurek Thanks for the help. I was able to visualize the results on test images. Got good results.
However, in images, there are places where the bounding box is placed over the cluster of same images. I would want the bounding boxes to be over the individual objects. I think I need to retrain with other parameters to improve the metrics, and then visualize again.
Anyone, Please let me know if there is something else I could do, probably other than retraining.
Use a score threshold when visualizing the detections. For example 0.5
Managed to get detections by changing the docker's pytorch to 20.07-py3.
Does anyone have idea, whether it is better to train the model with .jpg or .png images? Also, is it recommended to use the same size of test images while running inference as the training image size?
@xonobo I tried with 20/07 and couldn't get the detections during training. Can anyone else confirm that using 20.07 resolves this issue?
@nakul3112 the COCO evaluation metrics are described here: https://cocodataset.org/#detection-eval
As to what mAP you should be getting, it very much depends on your problem. Generally AP (first number) should be above 30%. When I am working with remote sensing data I normally use the AP IOU0.5 (the second number), which I hope to see above 75%.
Was the "No Detections" issue every resolved? I having this issue with odtk:latest.
@rbgreenway please see this comment: https://github.com/NVIDIA/retinanet-examples/issues/219#issuecomment-668883353
I tested the master branch with pytorch:20.09-py3 and the issue remains.
I used this:
odtk train retinanet_rn18fpn.pth --backbone ResNet18FPN \
--batch 1 \
--jitter 256 256 \
--resize 256 \
--val-iters 2000 \
--images /data/datasets/coco/images/train2017/ --annotations /data/datasets/coco/annotations/instances_train2017.json \
--val-images /data/datasets/coco/images/val2017/ --val-annotations /data/datasets/coco/annotations/instances_val2017.json
@yashnv is there any progress on this? Are you still investigating the issue?
This issue results from CUB, it shows up as invalid device ordinals. Still can't find another minimum reproducible error.
Closing due to inactivity
Fixed in master, issue was due to this: https://github.com/pytorch/pytorch/issues/52663
Hi,
I'm getting the "No detections!" output message for the validation step while training a retinanet model. This problem started to occur after I updated to v0.2.5. It was working fine when I was using v0.2.3.
I'm training retinanet with my own custom dataset by fine tuning on the COCO checkpoints provided with the 19.04 release. The version I was using was v0.2.3. I later wanted to update to v0.2.5 and retrain the model to see if there would be any improvement. But with all of the parameters being the same, I'm getting "No detections!" message. I tried for both RN50FPN and MobileNet backbones and nothing changed.
Other things I've tried:
Command I used to run training (Only difference is the train/val datasets)
For inference:
odtk infer retinanet_ResNet50FPN_COCO_test.pth --images=/datasets/COCO/val2017 --annotations=/datasets/COCO/annotations/instances_val2017.json
For the docker setup I simply followed the instruction in the ReadMe file. HW: NVIDIA GeForce RTX 2080