Orientation and descriptors are always extracted in the same Gaussian image inside one octave

lowsfer commented 5 years ago

Hi, I tried to read your code and found that you did not store the Gaussian smoothed images. In the original SIFT paper, orientation and descriptors are extracted from the Gaussian smoothed images with scale closest the key points. In your implementation, I see it's extracted from the first image of the octave (i.e. the only saved one). I understand this gives speed benefits, but did you thoroughly test scale/rotation invariance with this simplification?

Regards, Yao

Celebrandil commented 5 years ago

Yes, that is a correct observation. In the earliest version of the code, I used the layer corresponding to the scale for orientations and descriptors, but in 2010 I changed that and collapsed the scales in order to gain considerably more speed. It would have been interesting to see how that affected the accuracy and how it compares to the version of today. I would like to check that later next week if I have enough energy left after giving six lectures in seven days. It could definitely have affected the accuracy. What it should have affected the most though are the magnitudes of gradients, not the directions, which should limit the effects the change had on orientations and descriptors. Most important is to get the size of the window right.

lowsfer commented 5 years ago

Ah, busy week ;-)

I also implemented cuda-based sift months ago but recently found that yours is 3x as fast as mine. Was amazed and tried to find out why.

I just did a test with the following two images on my own SIFT implementation: The original image is 4000x3000. img1.jpg: resize to 1416x1062 and then chopped to 1000x750 img2.jpg: directly resized to 1000x750 So the scale ratio is sqrt(2):1

I used a high-resolution image as the original image to avoid difference in blur-ness between these two.

I tested with four settings:

Follow Lowe's paper: 1128 correct matches out of 4626/4411 feature points
only descriptor is simplified to use the same image per octave: 1122 correct matches
only orientation assignment is simplified to use the same image per octave: 1051 correct matches
both orientation and descriptor use the same image per octave: 1050 correct matches

Seems this simplification has higher impact on orientation than descriptor. The amount of correct matches drops by 6.4% and the accuracy drop relative to number of feature points is 1.6%. Probably that's acceptable for most applications.

Clever approach of you!

Do you know a public image data set that is suitable for this test? When I get time I want to write a script to do the same test on more images of different types. I left academia for a while and do not know the up-to-date databases. Actually I never knew one, cuz I was doing solar cells for my PhD :-(

Celebrandil commented 5 years ago

While (not) preparing for a lecture this morning, I had to do some testing and went back to the pre-2010 version of the code. I did the same observations you did. The orientation is quite sensitive to the correct scale, while the descriptor is not. Maybe it's possible to blur on the fly when computing the gradients for orientation estimation, without the need for more memory usage. As long as you don't increase memory usage, and pollute the caches, the speed should hardly be effected, since that part of the code is not that time-critical. Worth noting though is that the best scale for computing the orientation and descriptor might not necessarily be same as the one at which the DoG was detected. After all, they try to capture different things in the image, blobs vs edges. The most important difference with respect to a traditional implementation is the observation that the semi-group property of Gaussian filters allows you to compute all scales in parallel, without sequentially filtering from scale to scale. It's just that on CPU it used to be faster to do it sequentially, since you could use smaller filter kernels. When I find more time, I will try to dig up some databases.

lowsfer commented 5 years ago

Yes, it is mostly memory-bound.

For the DoG building, I also used similar fusion as you used. I see you are using one kernel per octave for DoG building step. I was also fusing generation of multiple Gaussian layers and DoG layers into one kernel, but not so aggressive as you did. The reason was that I was using pretty large Gaussian window (15x15 or 17x17 with separable optimization for consecutive Gaussian layers). With my window size configuration, the halo would be too big for fusing the whole octave (possibly 35x35). Maybe my window size setting was too conservative and can be reduce. I see you are using 9x9 for one whole octave. I'm wondering if that also has impact on accuracy.

Right, with the on-the-fly approach, we can calculate orientation with more accurate scale, by setting the Gaussian filter based on kpoint size. It will cause more memory read per kpoint, though, due to the halo. But maybe that's not a big issue.

colinlin1982 commented 5 years ago

You may love it: http://demo.ipol.im/demo/my_affine_sift/archive

Celebrandil / CudaSift

Orientation and descriptors are always extracted in the same Gaussian image inside one octave #52