Oriented bounding boxes for training on tilted objects?

grisevg commented 8 years ago

Do you think it's a good idea to add support for oriented bounding boxes?

It shouldn't be hard to add ability to rotate boxes to the imglab, I can probably do a simple PR.

I know that HOG/DPM algorithms don't work on oriented boxes, but those boxes could be extracted into chips/atlas with imglab and used for training then.

davisking commented 8 years ago

Not really. What are you going to do with oriented bounding boxes? If you want to do annotations that are more complex than a box I would add landmarks to the objects.

grisevg commented 8 years ago

OBBs could be extracted into chips/atlas (rotate and crop), that would get rid of tilting and could be used for training.

alireza7991 commented 8 years ago

I think OBB, landmarks (, +lines ?) would be good ideas for imglab. There are lots of complex models which cannot be described well by simple rectangles.

davisking commented 8 years ago

imglab already supports landmarks. The GUI for it is admittedly not the best it could be. It's alright, but could be improved. I don't think you need oriented bounding boxes if you have landmarks though.

davisking commented 8 years ago

Also, what kind of thing are you training where you think it's a good idea to crop out and normalize all the objects? I'm not sure that's very advisable.

grisevg commented 8 years ago

I thought it could be good for tilted objects in training sets. For example even though human faces are nearly always verticals, dogs faces are very often tilted. I could be completely wrong here though - is making an atlas of lots of faces and using it as a positive example for object detection algorithm (with lots of images used as a negative examples) a bad idea?

Is there a better approach to deal with tilted faces? Have only 1 face per image and rotate image or just cluster training set per different tilt angles?

davisking commented 8 years ago

Are you going to rotate the images at test time at multiple angles and run the detector on each?

grisevg commented 8 years ago

Yeah, I imagine that the common approach, right?

grisevg commented 8 years ago

Or is it too slow, and it's much better to have more detectors? EDIT: And I do also have landmarks, so I could extract tilt angle from two eyes landmarks for example.

davisking commented 8 years ago

I would make multiple detectors.

davisking commented 8 years ago

I would also make a bigger dataset full of image flips and rotations. Then I would run imglab --cluster on it to get a group of coherent poses to train detectors on.

You could also use a CNN (see https://github.com/davisking/dlib/blob/master/examples/dnn_mmod_ex.cpp). That doesn't require any pose clustering since CNNs are able to deal with all this stuff internally.

grisevg commented 8 years ago

What are pros and cons between HOG/DPM and CNN approaches for object detection? I imagine DPM is way faster at runtime?

alireza7991 commented 8 years ago

Isn't it better to run many detector with different angles in parallel? It would benefit from multi core processors without reducing performance and also can find a rotated box which would fit better in the tlited object. CNN can be rotation invariance but the result would not be a good rotated box which fits correctly into the true object. it would predict a bigger rectangle which ignores rotation

davisking commented 8 years ago

HOG is faster when running on the CPU, but the CNN is way more powerful.

If you run HOG on rotated images you need to recompute the entire HOG pyramid for each image. If you run multiple detectors on one image you don't. So it's way faster to not rotate.

grisevg commented 8 years ago

I actually only need bounding box detection as a first step of face alignment (facial points localisation).

alireza7991 commented 8 years ago

How much RAM does the MMOD CNN require for the example cpp ? I have 8GB but it ends with Bad Alloc error which I think is related to running out of RAM

davisking commented 8 years ago

The CNN example program uses a little over 5GB of GPU RAM. Only slightly more host RAM.

alireza7991 commented 8 years ago

So how may I find why do I get the Bad Alloc error?

grisevg commented 8 years ago

If my GPU has only 4GB I won't be able to train it? Does it mean I need 6GB+ VRAM GPU?

davisking commented 8 years ago

I don't know why you run out of RAM. I would use a debugger or some other system tool to figure it out.

You can control the amount of RAM used by setting the mini-batch size to something smaller.

davisking commented 8 years ago

It should also be pointed out that it takes a very long time to train. I have a titan X card and it takes about a day to train a good face detector on a reasonably sized dataset.

grisevg commented 8 years ago

Can CNN also be used for face alignment and is it comparable with "One Millisecond Face Alignment with an Ensemble of Regression Trees" implementation in dlib?

davisking commented 8 years ago

No, that's a separate thing.

alireza7991 commented 8 years ago

I was trying to use CPU. how may I make it possible to train in <30 minute with lower RAM? (good precision is not important)

grisevg commented 8 years ago

Got it. Thank you, I'll try both clustering dataset per tilt for DPM and using CNN.

davisking commented 8 years ago

@grisevg No problem.

@alireza7991 Edit the training loop to do fewer iterations is one way. Or change the random cropping parameters to produce less jittered data, then it will converge faster. Make the batch size smaller. Do some mix of those things.

grisevg commented 8 years ago

@davisking Oh one more question. Clustering dogs per breed made a huge improvement to DPM approach, but would it also make sense to do for CNN? Does similar approach with multiple detectors applies to CNN?

davisking commented 8 years ago

The CNN should be able to handle it all in one big model. That's it's claim to fame.

mrgloom commented 8 years ago

Does DLIB use any data augmentation while training DPM-like detector(for example horizontal-vertical flips, rotation, scale variations, color jittering)?

grisevg commented 8 years ago

@mrgloom if has upscaling. From http://dlib.net/fhog_object_detector_ex.cpp.html:

// Now we do a little bit of pre-processing.  This is optional but for
        // this training data it improves the results.  The first thing we do is
        // increase the size of the images by a factor of two.  We do this
        // because it will allow us to detect smaller faces than otherwise would
        // be practical (since the faces are all now twice as big).  Note that,
        // in addition to resizing the images, these functions also make the
        // appropriate adjustments to the face boxes so that they still fall on
        // top of the faces after the images are resized.
        upsample_image_dataset<pyramid_down<2> >(images_train, face_boxes_train);

For all other augmentations, as far as I understand, you just need to generate them yourself. dlib already has some implemented though, like mirroring with add_image_left_right_flips. It's quite easy to quickly generate most variations with OpenCV.

davisking commented 8 years ago

Right. Augmentation isn't going to happen automatically.

mrgloom commented 8 years ago

I have one more question regarding bounding boxes.

Is it ok when bounding boxes overlap?
Is it ok when bounding boxes overlap with ignore bounding box?

davisking commented 8 years ago

Yeah, both are fine.

mrgloom commented 8 years ago

@grisevg I found these types of augmentation [1,2,3]

@davisking Ok, but how are these cases handled?

During training positive sample is just crop by bbox? Is it ok if positive sample contain part of other positive object (because of bboxes overlap) or such cases should be avoided if possible?
How bounding box overlap with ignore bounding box handled? Overlaped positive bbox is ignored too or reduced by substruction ignore bbox?
Also if I have additional images without objects can I use them for hard-negative mining or something?

davisking commented 8 years ago

That's not at all how the trainer works. There is no cropping or subsampling. It trains on all possible windows in each image, so there is no need for hard negative mining either. This is explained in the example programs and in great detail in the MMOD paper.

grisevg commented 8 years ago

@davisking I was going through imglab source code and stumbled upon an angle property in dlib::image_dataset_metadata::box:

// The angle of the object in radians.  Positive values indicate that the
// object at the center of the box is rotated clockwise by angle radians.  A
// value of 0 would indicate that the object is in its "standard" upright pose.
// Therefore, to make the object appear upright we would have to rotate the
// image counter-clockwise by angle radians.
double angle;

That's what I was talking about when I made this issue. Is it used anywhere? I was searching through code, but didn't find any usages.

davisking commented 8 years ago

It's not used anywhere.

davisking / dlib

Oriented bounding boxes for training on tilted objects? #232