My experiences working with the MMOD loss

Lately, I've been training some object detectors using the MMOD loss layer and I have seen that many people have trouble getting it to work. I've struggled in some cases, so I am going to share my train of thought on how I build an object detector using this layer in dlib.

The first thing that I do, of course, is to define what kind of object I want to detect and under what circumstances. Then, I gather some meaningful example images that I want my detector to work on.

After that, I usually start using a very simple network, like the one defined in the car detector examples. Of course, the architecture some times needs to be adapted for my particular case. In my last project, I was in charge of creating a person detector. The main difference with regards to the car detector is that people's aspect ratio is very different from cars. In my case, I can actually see people clearly on images as long as they are bigger than 140 x 40 pixels, which means I need to define a network that is able to cope with this target, so I defined this network:

namespace person_detector
{
    using dlib::loss_mmod, dlib::con, dlib::relu, dlib::bn_con, dlib::affine,
          dlib::input_rgb_image_pyramid, dlib::pyramid_down;
    // a relu + BN + 5x5 convolutional block with downsampling
    template<long num_filters, template<typename> class BN, typename SUBNET>
    using rcon5d = relu<BN<con<num_filters, 5, 5, 2, 2, SUBNET>>>;
    // a relu + BN + 5x5 convolutional block without downsampling
    template<long num_filters, template<typename> class BN, typename SUBNET>
    using rcon5 = relu<BN<con<num_filters, 5, 5, 1, 1, SUBNET>>>;
    // a block that downsamples the image 4 times (factor of 16)
    template<template<typename> class BN, typename SUBNET>
    using downsampler = rcon5d<32, BN, rcon5d<32, BN, rcon5d<32, BN, rcon5d<16, BN, SUBNET>>>>;
    // the actual detector
    template<template<typename> class BN>
    using net_type = loss_mmod<
        con<1, 9, 9, 1, 1,
        rcon5<55, BN, rcon5<55, BN, rcon5<55, BN,
        downsampler<BN,
        input_rgb_image_pyramid<pyramid_down<6>>
    >>>>>>;
    using train = net_type<bn_con>;
    using infer = net_type<affine>;
}

This network downsamples the image by a factor of 16. This means that when the filter from the last layer is scanning the image, each pixel it moves in that feature space are 16 pixels in input image space. Since the minimum size of the object is around 40 pixels, it means that the network will scan the whole image and "see" my objects, so that's fine. Also, the size of that window is 9 x 9, which means it covers an area of (9x16) x (9x16) = 144 x 144 pixels, so my objects fit inside of the sliding window.

So far so good. Then I annotate some images (in my case I annotated 1000 images) and train the network using the cropper with the same settings as the one in the cars examples. I even print the dnn_trainer to check that the detector windows make sense, and visualize the and the random_cropper to make sure it's working as I expect:

dnn_trainer details: 
  net_type::num_layers:  24
  net size: 0.00222778MB
  net architecture hash: 7c4a1a7ef502c84d406d863b6fef505a
  loss: loss_mmod    (detector_windows:(person:68x140,121x140,39x140,140x91,140x60,140x42), loss per FA:1, loss per miss:1, truth match IOU thresh:0.5, use_bounding_box_regression:false, overlaps_nms:(0.512446,0.874268), overlaps_ignore:(0.5,0.95))
  synchronization file:                       fashion_mmod_sync
  trainer.get_solvers()[0]:                   sgd: weight_decay=0.0001, momentum=0.9
  learning rate:                              1e-06
  learning rate shrink factor:                0.1
  min learning rate:                          1e-05
  iterations without progress threshold:      50000
  test iterations without progress threshold: 1000
random_cropper details: 
  chip_dims.rows:              350
  chip_dims.cols:              350
  randomly_flip:               true
  max_rotation_degrees:        5
  min_object_length_long_dim:  138
  min_object_length_short_dim: 37
  max_object_size:             0.7
  background_crops_fraction:   0.5
  translate_amount:            0.1

To my surprise, it didn't work. I couldn't get the loss to go below 1.20... I thought that maybe my network wasn't powerful enough for this kind of task, so I tried with deeper networks, residuals blocks, and still no luck.

Then, at some point, I wanted to increase the batch size, so I changed the size of the crops from 350 x 350 to 224 x 224. But I forgot to increase the batch size, so the only change was the crop size. Now, the simple network I defined above converged and got an object detector that was able to achieve a loss of around 0.07 on the training set (800 images) and 0.77 on the test set (200 images):

  set |     prec.   recall      mAP
------+----------------------------
train |   0.94968 0.677546 0.673228 
 test |  0.751381 0.346939 0.324793

Of course, I need some more images so that the network generalizes better, but at least it's working.

Anyway, sorry for this long post, but I felt like explaining why I changed the network the way I did might help people who just blindly take the code from the examples and expect them to work on their data set.

Also, I still can't understand why the cropper size has such an impact on the network performance, up to the point that it can make it work or not. And I am posting this because it's the second time that I find this kind of behavior when using the MMOD loss in a neural network.

So, if someone has some ideas on why it wasn't working with crops of size 350 x 350 but worked on crops of 224 x 244, I would really be interested :)

As always, thanks for releasing such an awesome library! It is really a pleasure to use!

Update: I've run another test and tried to be more patient during the training. It turns out using 350 x 350 crops kind of works, however, not nearly as well as with 224 x 224. Usually I stopped after the training loss didn't improve for 10000 steps, so I increased that limit (and got several loss not decreasing warning, reloading from previous state messages, though).

For those who are curious, below I show the loss plot of both training experiments, where the only change is the size of the random cropper:

training with 224 x 224 crops
training with 350 x 350 crops

What's interesting about the last plot is that the training loss dramatically decreases after 40000 steps, but this was not due to a decrease in the learning rate, which had still the initial value of 0.1.

As you can see, the behavior of the network if completely different, so again, if somebody has any insights on this, I will be glad to discuss them :)

Bonus: here's the gnuplot script to plot the curves directly from the training log:

#!/usr/bin/env gnuplot

# set term wxt 1 noraise title 'MMOD training'
set term png size 800,600
set output 'loss.png'
set grid
set style line 1 lc rgb '#C00000' lt 1 lw 2 pt 7 pi -1 ps 0.5
set style line 2 lc rgb '#0000C0' lt 1 lw 2 pt 7 pi -1 ps 0.5
set yrange [0:2]
# set pointintervalbox 1 # change lines to linespoints below to draw overlapping points
plot 'training.log' every ::1 using 2:8 title 'train loss' w lines ls 1, \
     'training.log' every ::1 using 2:11 title 'valid loss' w lines ls 2
pause 10; refresh; reread;

Then just run the trainer as:

./train_ex | tee -a training.log

and run

./plot.gp

Yeah, this is a thing that happens. So the thing to remember is that the loss is getting an image crop and being told "this image has 1 or 2 target objects, find them, but also a huge number of negative positions, don't detect those!". A 350x350 image has about 478 possible positions (after 16x downsampling) for objects. The image pyramid will blow this up to like 2800 positions. So you have this situation where every training chip has 1 or 2 positive objects, and on order of 2800 negative objects. The bigger the crops the more the imbalance, and the more tempted the loss will be to learn "nothing is every a positive object".

This is made harder in some problems than others. If the things you want to detect are visually distinct from background clutter then this is much less of an issue. But if there are lots of little objects in the background that sort of look like people (which is true, pedestrian detection in images is hard since there are lots of things that, when blurry, look confusingly like people. Or actually are pictures of people but don't count for what you want to detect. Or you have a crowd and there are unlabeled people in the background, i.e. labeling errors) then the loss will be sorely tempted to improve the loss on all these hard negatives by never detecting anything.

So what can can happen is the trainer spends a long time keeping the false alarms down until it eventually finds some parameter settings that let it grab onto the positives without also detecting a whole bunch of negatives. Once it gets into that part of the parameter space you see a sudden drop in loss as it's able to make progress.

So not using crops that are too big can help a lot. Also, running the detector on your training data and looking at false positives and trying really hard to improve the labeling is always a huge deal. It's usually especially useful to put ignore boxes on ambiguous cases.

Oh, that makes a lot of sense. Thank you very much for the detailed explanation.

And you guessed right: there are many blurry people in the background and crowd, but I am labeling the data myself and I am already a little bit acquainted with the MMOD loss, so I make good use of the ignore boxes :)

Yeah, I expect you know about the ignore boxes :)

But still, it's hard to label datasets consistently and well. I've labeled a huge amount of data in my career, and I still never get it right on the first or even second run through a dataset. I find I always need to train a model, and then look at the outputs of that model on the dataset. I always find there are persistent errors I make in labeling. Something that is particularly useful is to split your dataset in half, train on the first half, then run the model on the second half and look at the things it misses, and the false alarms. There is generally some pattern. Either things you missed, or should put ignore boxes on, or things you labeled that are just unreasonable (e.g. super blurry people that are indistinguishable from background clutter). Really really good dataset curation is usually the secret of a high quality model.

Thank you for your advice, I really appreciate you spending time on explaining this kind of know-how. I have the feeling that there are a lot of hours of experience behind every word you say.

That is more or less what I am doing, I label a few images, train on those and while it's training, I keep labeling and check the results of the model on the newly labeled images. But I guess I still need a few iterations and looking thoroughly at the data as you say and try to find unreasonable labels and patterns.

davisking / dlib

My experiences working with the MMOD loss #1928