Closed grisevg closed 8 years ago
Not really. What are you going to do with oriented bounding boxes? If you want to do annotations that are more complex than a box I would add landmarks to the objects.
OBBs could be extracted into chips/atlas (rotate and crop), that would get rid of tilting and could be used for training.
I think OBB, landmarks (, +lines ?) would be good ideas for imglab. There are lots of complex models which cannot be described well by simple rectangles.
imglab already supports landmarks. The GUI for it is admittedly not the best it could be. It's alright, but could be improved. I don't think you need oriented bounding boxes if you have landmarks though.
Also, what kind of thing are you training where you think it's a good idea to crop out and normalize all the objects? I'm not sure that's very advisable.
I thought it could be good for tilted objects in training sets. For example even though human faces are nearly always verticals, dogs faces are very often tilted. I could be completely wrong here though - is making an atlas of lots of faces and using it as a positive example for object detection algorithm (with lots of images used as a negative examples) a bad idea?
Is there a better approach to deal with tilted faces? Have only 1 face per image and rotate image or just cluster training set per different tilt angles?
Are you going to rotate the images at test time at multiple angles and run the detector on each?
Yeah, I imagine that the common approach, right?
Or is it too slow, and it's much better to have more detectors? EDIT: And I do also have landmarks, so I could extract tilt angle from two eyes landmarks for example.
I would make multiple detectors.
I would also make a bigger dataset full of image flips and rotations. Then I would run imglab --cluster on it to get a group of coherent poses to train detectors on.
You could also use a CNN (see https://github.com/davisking/dlib/blob/master/examples/dnn_mmod_ex.cpp). That doesn't require any pose clustering since CNNs are able to deal with all this stuff internally.
What are pros and cons between HOG/DPM and CNN approaches for object detection? I imagine DPM is way faster at runtime?
Isn't it better to run many detector with different angles in parallel? It would benefit from multi core processors without reducing performance and also can find a rotated box which would fit better in the tlited object. CNN can be rotation invariance but the result would not be a good rotated box which fits correctly into the true object. it would predict a bigger rectangle which ignores rotation
HOG is faster when running on the CPU, but the CNN is way more powerful.
If you run HOG on rotated images you need to recompute the entire HOG pyramid for each image. If you run multiple detectors on one image you don't. So it's way faster to not rotate.
I actually only need bounding box detection as a first step of face alignment (facial points localisation).
How much RAM does the MMOD CNN require for the example cpp ? I have 8GB but it ends with Bad Alloc error which I think is related to running out of RAM
The CNN example program uses a little over 5GB of GPU RAM. Only slightly more host RAM.
So how may I find why do I get the Bad Alloc error?
If my GPU has only 4GB I won't be able to train it? Does it mean I need 6GB+ VRAM GPU?
I don't know why you run out of RAM. I would use a debugger or some other system tool to figure it out.
You can control the amount of RAM used by setting the mini-batch size to something smaller.
It should also be pointed out that it takes a very long time to train. I have a titan X card and it takes about a day to train a good face detector on a reasonably sized dataset.
Can CNN also be used for face alignment and is it comparable with "One Millisecond Face Alignment with an Ensemble of Regression Trees" implementation in dlib?
No, that's a separate thing.
I was trying to use CPU. how may I make it possible to train in <30 minute with lower RAM? (good precision is not important)
Got it. Thank you, I'll try both clustering dataset per tilt for DPM and using CNN.
@grisevg No problem.
@alireza7991 Edit the training loop to do fewer iterations is one way. Or change the random cropping parameters to produce less jittered data, then it will converge faster. Make the batch size smaller. Do some mix of those things.
@davisking Oh one more question. Clustering dogs per breed made a huge improvement to DPM approach, but would it also make sense to do for CNN? Does similar approach with multiple detectors applies to CNN?
The CNN should be able to handle it all in one big model. That's it's claim to fame.
Does DLIB use any data augmentation while training DPM-like detector(for example horizontal-vertical flips, rotation, scale variations, color jittering)?
@mrgloom if has upscaling. From http://dlib.net/fhog_object_detector_ex.cpp.html:
// Now we do a little bit of pre-processing. This is optional but for
// this training data it improves the results. The first thing we do is
// increase the size of the images by a factor of two. We do this
// because it will allow us to detect smaller faces than otherwise would
// be practical (since the faces are all now twice as big). Note that,
// in addition to resizing the images, these functions also make the
// appropriate adjustments to the face boxes so that they still fall on
// top of the faces after the images are resized.
upsample_image_dataset<pyramid_down<2> >(images_train, face_boxes_train);
For all other augmentations, as far as I understand, you just need to generate them yourself.
dlib already has some implemented though, like mirroring with add_image_left_right_flips
.
It's quite easy to quickly generate most variations with OpenCV.
Right. Augmentation isn't going to happen automatically.
I have one more question regarding bounding boxes.
Yeah, both are fine.
@grisevg I found these types of augmentation [1,2,3]
@davisking Ok, but how are these cases handled?
That's not at all how the trainer works. There is no cropping or subsampling. It trains on all possible windows in each image, so there is no need for hard negative mining either. This is explained in the example programs and in great detail in the MMOD paper.
@davisking I was going through imglab source code and stumbled upon an angle
property in dlib::image_dataset_metadata::box
:
// The angle of the object in radians. Positive values indicate that the
// object at the center of the box is rotated clockwise by angle radians. A
// value of 0 would indicate that the object is in its "standard" upright pose.
// Therefore, to make the object appear upright we would have to rotate the
// image counter-clockwise by angle radians.
double angle;
That's what I was talking about when I made this issue. Is it used anywhere? I was searching through code, but didn't find any usages.
It's not used anywhere.
Do you think it's a good idea to add support for oriented bounding boxes?
It shouldn't be hard to add ability to rotate boxes to the imglab, I can probably do a simple PR.
I know that HOG/DPM algorithms don't work on oriented boxes, but those boxes could be extracted into chips/atlas with imglab and used for training then.