Custom detection misaligned?

lorenzob commented 7 years ago

Hi, I trained a detector and predictor to detect ID cards on an extremely small data set (7 images).

It works, but most of the results, even from the training set, are misaligned, in different directions and amounts. Sometimes smaller, others larger. Here one example:

selection_109

selection_108

It seems like a simple case, like the stop sign. Any idea what could be the reason? Too few samples? Difference in resolution in the samples? Too little training? Training parameters? Bad placement of the shape markers during the training (too little information/contrast in those locations)? Too few? My mistake in drawing the overlays in the final image(!?)?

The example above is from train_shape_predictor.py, I only changed the way the shape marks are shown because win.add_overlay(shapes) only accepts 68 points.

Thanks in advance.

davisking commented 7 years ago

It's impossible to tell from what you posted. How many training images? What do they look like?

lorenzob commented 7 years ago

Hi Davis, I trained 7 images and they all look like this one, a single ID card almost filling the whole image.

I've uploaded here the whole set: https://www.dropbox.com/s/umnz4x31llfr0ec/sample-IDs.zip?dl=1

The four markers are all in the same places. Maybe not enough "border" outside the marked region? This is the resulting detector:

selection_110

I used only four markers to see if this was enough, to speed up later the training data manual labeling. I could add more, but I got the same result with 12 markers.

davisking commented 7 years ago

What's the point of running a detector on images if they are all images of perfectly cropped objects already? Aren't they all in the same position?

lorenzob commented 7 years ago

Yes, but they won't be like these in the real data I'll need to match. I assumed using clear flat data for training was the best option.

davisking commented 7 years ago

No, it's the worst option. The detector is going to learn something pathological like id cards are always surrounded by a black border since there is implicit black padding around the edges of an image. There are probably other important details in real images that are also missing like slight rotations or illumination changes.

You need to give training images that look like images you want to actually use.

lorenzob commented 7 years ago

Ok, I agree with that. But here I was just trying to train on 7 simple images and then match the result against those same images expecting optimal results on these. Just to see if I'm doing things right. Later I'll train (and label...) on a much bigger data set, but first I wanted to test it (even overfitting is fine in this phase) but I'm getting strange results, much different from what you get with faces and dogs.

davisking commented 7 years ago

There is a certain stride size used to step the detector over an image. It's not going to put the box perfectly at the pixel you select. The shape predictor however should be able to get pixel perfect results on the training data. You have probably set the training parameters to some very simple model that is incapable of fitting the data. Read the documentation and try different parameters.

e-fominov commented 7 years ago

@lorenzob , looks like you are detecting with about to 96x64 pixels detection window size (one FHOG cell is about 8x8 pixels and you have 12x8 detector). Your original image has size of 591x434 pixels. So the detection process will work after downscaling the image to ~99x72 pixels (~6x downscale) with the accuracy ±8 pixels. After upscaling the detection result, this ±8 pixels will become ~±48 pixels - this is the result you really have.

Real accuracy is slightly better because of multiscale image pyramid, but this is the main reason of why your detector does not produce pixel-wise accuracy - your detection window is small and you are detecting large object

One other reason is the absence of any margins around the detection object on your training images, so it is hard for detector to fit well. If you really don't have any margins and you know that there is right object - why do you need to detect? Just use full image

So I recommend you to:

make rectangles landmarks on training images with imglab tool. this is already done by you
extend your training dataset by distorting source images with small rotation, translation (add margins by replicating the border), scaling and may be some color changes. there is no ready-to-use code for this procedure. it's up on you - try to imitate real images
train object_detector
re-detect training images with trained detector and replace rectanges drawn by hand with detected rectangles. this can improve accuracy if your dataset has less than 10k images
train shape_predictor with at least 1000 images. you can extend your dataset with distorting training images
repeat steps 3-5 with different training parameters until you will have good results

lorenzob commented 7 years ago

Hi Evgeniy, thanks for the detailed answer. I'll try all your suggestions as soon as I better understand what is happening with my toy example (except for wider margins and boxes).

What I noticed is that the predicted landmarks blindly follows the detection box. If the box is bigger or shifted all the rest is shifted accordingly. It seems like it's not even trying to fit the landmarks. If I pass to the detector a randomly placed box it happily draws the landmarks inside it. Or maybe it tries to place them but the box size/location cancels all the effort. Or it fails to find a match and uses default locations?

What I expect is that a small misalignment in the box should not influence the location of the landmarks (and I think Davis comment confirms this).

I made a step back and I'm working on just one single cat picture trying to get perfect results on this. With a wide border around the subject and using a larger box during the training. Maybe this is a wrong approach here, but it's what I'd do in other training tasks: get near 100% accuracy on a small training set first. I can create 100 artificial variations of this picture but I'm not sure if it's going to help in this case (where I want to match only this one sample, not generalize).

These are the predictor training parameters (I tried a few more but with no real difference except in training time):

options.oversampling_amount = 10000
options.nu = 0.2
options.tree_depth = 5
options.feature_pool_size = 400

And this is what I get:

Original image: 13261923_f520

Training(red) and results (yellow) cats

solo-predictor.xml.tar.gz

The yellow box is wider and shifted to the left and the landmarks follow.

What am I missing?

lorenzob commented 7 years ago

One extra element. I tried to train the "faces" example and this is what I get on the bald_guys jpeg:

selection_120

Is this correct? It looks similar to my problem.

davisking commented 7 years ago

There is no problem. You just need more training data. With just one (or a few) training images this is exactly what you would expect. If you want detailed insight into the method you can read the paper by Kazami that describes the algorithm. If you read it it should be obvious what's happening and why.

lorenzob commented 7 years ago

Honestly I've read the whole paper, but reading and understanding is not the same thing. It's something you have to study, not simply read.

I am confused because the faces demo works much better than my experiments with only a few images and I expected the same.

What I'm asking is not algorithms details but only if I am doing something clearly wrong with the training/labeling or later. If the problem is only the amount of data I'll try to label more images and repeat the training. I expected a better feedback from the first experiments and I'm trying not to waste days labeling images in the wrong way.

davisking commented 7 years ago

Nothing is obviously wrong, aside from the small amount of training data.

davisking commented 7 years ago

I should also point out that the face example dataset has 18 faces, which while still way too small to make a usable model, is still much larger than

18 is big enough to begin to see that it's working.

e-fominov commented 7 years ago

Training(red) and results (yellow) What am I missing?

this is the same problem as you have with ID-cards. you simply need more data and you need to prepare it better if you never train with landmarks outside of rect, shape_predictor will never give you them outside. same with rectangle position and size. If you train with data where landmarks are in top 1/2 of rectangle, it will never detect them at the bottom. Rectangle defines the initial position and scale for running shape predictor and the detector process is based on trained statistics. And I recommend you to re-detect the rectangles after drawing them by hand. It is well-known, that shape_predictor will not find landmarks correctly if you detect the face with some other detector (may be from OpenCV).

PS. shape_predictor training is very sensitive to amount of data. instead of labeling images manualy - generate them by writing a software :)

lorenzob commented 7 years ago

Davis, Evgeniy thanks for your help.

I wrongly assumed that all the landmarks had to be inside the rectangle.

With "re-detect" you mean running the detector on a set of images, exporting new a training xml with the detected rectangles data and run again the detector training on this new file, correct?

About auto-labeling, right now I do not have a working detector. My plan was to label a few manually until I get a working one, then use this to automatically label a few dozens more and manually fix the results, retrain and proceed like this incrementally.

e-fominov commented 7 years ago

With "re-detect" you mean running the detector on a set of images, exporting new a training xml with the detected rectangles data and run again the detector training on this new file, correct?

no need to re-run detector training again. you should train shape_predictor only on improved dataset

davisking commented 7 years ago

We just mean that the rectangles you use for training a shape predictor should come from some detector you actually have available. However, you will generate rectangles when you really run the shape predictor is how you should generate the rectangles for the training data. In general, in machine learning, you want to make sure the training and testing data are going to be as similar as possible.

davisking commented 7 years ago

As for labeling, just label a bunch manually. You can easily label a few thousand boxes in an hour with a mouse and imglab. As for shapes though, yes, it's generally useful to make some crude auto-labeler just to get the landmarks into the image and then you can drag them into their proper places using imglab.

lorenzob commented 7 years ago

I made a small modification to imglab to be able to add landmarks in sequence just by shift+clicking while a rectangle is selected. My cpp is so ugly and the code so small that I really think it's not worth contributing but I think I could describe the idea in details in a separate issue if this sounds interesting.

davisking commented 7 years ago

Yeah, the current way imglab makes you add points is kinda terrible :)

The way I do it is by programmatically generating the XML file with the points already in it and then use imglab to drag them to the correct locations. I still think that's the best way to do it. However, it would still be nice if imglab had a better way to add points on the fly. So if you want to subit a PR for this that would be cool. But you would need to make the code reasonably clean.

davisking / dlib

Custom detection misaligned? #800