facebookresearch / video-long-term-feature-banks

Long-Term Feature Banks for Detailed Video Understanding
Apache License 2.0
373 stars 62 forks source link

Low training loss yields worse result on validation set #43

Closed taosean closed 4 years ago

taosean commented 4 years ago

Hi,I have a confusion about the experiment result I got.

The thing is, I trained a new detector using the detection model you provided, only with larger batchsize. We compared our detector with the model this repo provided and found we got higher mAP and mAR than provided model on validation set. For keyframes in validation set which have no boxes labeled (referred to as BG images), we detected less boxes than provided model. So from the perspective of detector's metrics, I think we got a better detector.

Using that detector trained by ourself, we followed the steps you described in paper, which are

  1. Do detection on training set and got boxes.
  2. Do IoU calculations between detected boxes and GT boxes (v2.2), assign labels of GT boxes to those detected boxes which haveIoU no less than 0.6 (in paper it is 0.9, but we got much less records than the provided ava_train_predicted_boxes.csv with 0.9, and we found 0.6 got similar number of records).
  3. Use the detected box (action label assigned) together with GT boxes (v2.2) to train a new baseline model (ava_r101_baseline). The training is configured with batchsize=32 (16 in provided model), initial learning rate=0.08 (0.04 in provided model), STEP_SIZES:[110000, 20000, 10000], other parameters are kept as they are.

We compared our training loss with the one provided by 102760714.log , the following is the training loss baseline_loss

As you can see, our training loss is lower than the provided model.

However, we evaluated our model on validation set, the mAP we got is not as good as the provided model.

60k 80k 100k 120k 140k
our 0.2258 0.2238 0.2178 0.2270 0.2288
102760714.log 0.2318

I cannot explain the result well, is this because of overfitting? Or the differences of distribution between training set and validation set? Any tips on training the detector, baseline model and LFB model?

Could you share your insights if you have any?

Best regards.

chaoyuaw commented 4 years ago

Hi @taosean , one thing I noticed is that your training schedule is roughly 2x longer (effectively) than ours. That is, according to linear scaling rule, since you're using 2x batch size and 2x lr, the corresponding training schedule should be 0.5x. Namely, STEP_SIZES: [50000, 10000, 10000], and MAX_ITER: 70000. Maybe it worths trying to see if it leads to a different result.

chaoyuaw commented 4 years ago

Looking at the training curves you provided, it seems that blue curve at 4000 iteration roughly achieves a similar loss of the orange curve at 8000 iteration. This seems reasonable. (I suggest scaling x-axis of the blue curve by 2 for clearer comparison, since it effectively trains 2x faster due to the increased BS and LR. )

taosean commented 4 years ago

Thank you @chaoyuaw , I tried, it does help! Looks like it was due to overfitting. Thank you very much.