Assignment-3: Two-view reconstruction

karnikram commented 5 years ago

Task:

To reconstruct the sparse structure of a scene from two given images of it.

assign-3-sample

Steps:

Feature extraction and matching: Detect interest points in both the images, extract descriptors (using SIFT/SURF) for each of them, and match them to form two corresponding sets. Existing library implementations such as this one can be used for this step.
Motion estimation: From the two sets of corresponding points, estimate the fundamental matrix between the two views using the normalized eight-point algorithm. Implement the algorithm within a RANSAC scheme to take care of outliers from the previous step. Remember to normalize the image points and then to 'de-normalize' the estimated fundamental matrix. Use T = [1.44/d, 0, -1.44/d mu(0); 0, 1.44/d, -1.44/d mu(1); 0, 0, 1] where d is the mean distance of all the image points from the origin, and mu is the mean of all the points. Then convert the fundamental matrix into an essential matrix by using the provided calibration matrix and decompose it into relative R, Sb using any existing implementation. (Hartley & Zisserman's method). A Matlab implementation of the decomposition is also provided.
Triangulation: Once the relative motion (orientation) has been estimated, implement linear triangulation to estimate the 3d position of each corresponding point.

Files:

The accompanying files can be found here. Matlab (with the computer vision toolbox) is recommended for this assignment since it is easier to get started with.

Deliverables:

The estimated F-matrix, R, and t.
A 3d plot of the reconstructed scene with the two views represented by pyramids (which you used in the previous assignments).
Code files, written in a modular way and with meaningful comments.

Email your results to karnikram@gmail.com and ansariahmedjunaid@gmail.com.

Deadline:

Wednesday, the 29th.

Use this thread for any doubts you might have.

aadilmehdis commented 5 years ago

Can you please explain the construction of the Normalization Transform Matrix T.

Currently, with the definition of T, it seems like we are shrinking the image coordinates in the range [-root(2), root(2)], instead of [-1, 1]. Could you explain why we are doing this?

Moreover, the scaling factor is the same along x and y directions. However, if we are normalizing the points, it should have different for x and y directions. Possibly the average width and height of the image. Could you explain why we are taking the Euclidean distance of the image points as well?

TIA

karnikram commented 5 years ago

Good questions.

Currently, with the definition of T, it seems like we are shrinking the image coordinates in the range [-root(2), root(2)], instead of [-1, 1]. Could you explain why we are doing this?

This normalization matrix first translates all the image coordinates so that they're centered around the origin, and then applies a scaling so that the average distance of a point from the origin is sqrt(2). This means that an average point is equal to (1,1,1). This is desirable because this means that each of the entries in the A matrix will also have similar magnitude. And since in DLT we are in a way minimizing A, we make sure that adjusting every entry will have similar effect on the image points, and is not skewed by some entries. This way the algorithm becomes more stable.

This point is explained in much more detail in this paper by Hartley.

Moreover, the scaling factor is the same along x and y directions. However, if we are normalizing the points, it should have different for x and y directions. Possibly the average width and height of the image.

In the paper he also shows that applying a non-isotropic scaling (different factors for x and y directions) actually has little effect on the results.

Could you explain why we are taking the Euclidean distance of the image points as well?

I don't have an answer to why we take l-2 norm and why not any other norm.

RoboticsIIITH / summer-sessions-2019

Assignment-3: Two-view reconstruction #10