mAP decreases with training loss on a custom dataset

zlyin commented 4 years ago

Instructions To Reproduce the Issue:

Hi there, thank you for making this architecture available! I'm trying to use it on my custom dataset but confronted a wired situation, as indicated by the title. If you guys happen to come across the same issue, could you let me know the reason? Great thanks in advance!

what changes you made (git diff) or what code you wrote
- I didn't modify the code in the repository but implemented a training script to fine-tune the pretrained weights on my own dataset. My train data has 2700 images & valid data has 675 images. The dataset only has 1 object class.
- I attached my script here: train_wheat.txt

what exact command you run:

python3.6 train_wheat.py --epochs 100 --batch_size 2 --image_size 1024 --name detr_test1 --device 0

what you observed (including full logs):
- The training loss & valid loss decrease normally, however, the mAP values also drop with them, which is out of recognition.
please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.

Expected behavior:

I except the mAP to increase while losses decrease.
I follow the training pipeline posted in another issue

Environment:

Provide your environment information using the following command:

Collecting environment information...
PyTorch version: 1.5.1+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
CMake version: version 3.17.3

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce RTX 2080

Nvidia driver version: 418.88
cuDNN version: /usr/local/cuda-10.0/lib64/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.17.0
[pip3] pytorch-ignite==0.3.0
[pip3] torch==1.5.1+cu101
[pip3] torchvision==0.6.1+cu101

fmassa commented 4 years ago

Hi,

First thing I would check is the learning rate -- maybe it is too high for your fine-tuning.

Apart from that, I would encourage checking the discussion in https://github.com/facebookresearch/detr/issues/9 and https://github.com/facebookresearch/detr/issues/125 for some issues that can bring more insights into where the problem might be.

alcinos commented 4 years ago

I would double check if one of your validation losses is going up - that would hint at overfitting

zlyin commented 4 years ago

Hi @fmassa @alcinos Thank you for your reply! I actually have gone through the issue of #9 & #125 to reach current progress. I'll go through them again to see if can extract more insights.

As for your suggestions,

I've tried a LR=2e-5, which I think small enough, normally. but I would try a lower value & see the result.
I haven't plotted out the losses separately. I'll try it and come up with an updated plot for your information. By the way, if I only use it for object detection task, what's the weight of each loss? Currently, I have {loss_ce : 0.5, loss_bbox : 1, loss_giou : 1}, is this ok to use?

Thank you very much!

fmassa commented 4 years ago

I haven't plotted out the losses separately. I'll try it and come up with an updated plot for your information. By the way, if I only use it for object detection task, what's the weight of each loss? Currently, I have {loss_ce : 0.5, loss_bbox : 1, loss_giou : 1}, is this ok to use?

I believe it would be preferable to use the default values for the loss coefficients, as changing them while using a pre-trained model might be suboptimal

zlyin commented 4 years ago

Hi @fmassa @alcinos, I created a post in the thread of #9 as you suggested. You can find my updates there. Thank you very much for your help!

facebookresearch / detr