Different disparity levels?

LouisFoucard / w-net

w-net: a convolutional neural network architecture for the self-supervised learning of depthmap from pairs of stereo images.

203 stars 47 forks source link

Different disparity levels? #3

Open aellaboudy opened 7 years ago

aellaboudy commented 7 years ago

In this work, you've hardcorded the expected disparity levels to +- 16. Is there anything stopping us from increasing those to a larger number, like +-128? There are other datasets that I'd like to test on that have these levels of disparity. What would need to change in the code to enable larger disparity searches?

LouisFoucard commented 6 years ago

So the Selection layer takes as input a list of disparity level at initialization. When constructing the network, I hardcoded +-16 in the build function, but that can be changed to any other list of disparity levels. The other thing you will have to be careful about, is to make sure that the number of channels in the last convolution before the selection layer corresponds to the total number of disparity levels you want to have. For example, if you want a total of 256 disparity levels, the input to the selection layer must looks like (batchsize, H, W, 256). I will make the disparity levels an input to the network building function in a future push.

aellaboudy commented 6 years ago

Yes, I realized that changing the hardcoded +-16 value was not enough. I was able to get the model to compile after changing the dimensions on your model. You can see my work at https://github.com/aellaboudy/w-net.

However, the memory requirements quickly grow to become too large and I run out of memory for training. Even when memory is not a problem, it seems like the loss becomes unstable and I get "NaN" for the loss function after a few epochs.

aellaboudy commented 6 years ago

It seems like with larger disparity values you must reduce the learning rate. I'm using a learning rate of 1e-7 now and it seems like that produces stable loss metrics. I also changed the disparity ranges to [0,64] for the left disparity and [-64,0] for the right disparity. I'm making the assumption that the images are rectified and thus, can only take on negative or positive values for the right or left disparities, respectively. This should be true for the dataset I'm using, the KITTI dataset.

I also added one more layer to you model and also added a consistency check to the loss to make sure that the left and right disparity maps are consistent.

The results I'm getting so far are not that great for the KITTI dataset, the disparity output seems to be very quantized and I see "squares" in the disparity output that are equal in size to the max disparity. Trying to debug to see what is going wrong. Any help appreciated.

aellaboudy commented 6 years ago

After 100 epochs on the KITTI dataset, results are not stellar even after stretching the disparity levels to [0,120] and [-120,0] for the left and right images, respectively. Sample image below.

It would be great to understand what is stopping it from working well with larger disparities...one theory I have is that the KITTI dataset is much more challenging because of illumination changes (shadows, etc.) which make for very non-uniform image gradients on smooth depth surfaces. See the image. This paper has some pretty interesting ideas on how to get around illumination changes with slightly more complex loss functions and and more complicated architecture.

Ockhius commented 5 years ago

It seems to me that the main problem keeping network back for predicting larger disparities is the receptive field of the network. I made a rough calculation and it looks like for a pixel at the output receptive field is around 160 pixels([-80,+80]). if we say it looks for correlation in the vicinity of pixel to calculate shift/disparity, it cant get enough info for those disparities you want it to work. Actually even precision should decrease towards higher disparity values. Deeper arch might increase valid interval of disparity, though it will be harder to train.

the paper you mentioned solves it by forming cost volume before pushing it for disparity calculation. It is similar to shifted_image stacks in this work. Receptive field is no more limiting, since input pixels are already aligned wrt possible disparity values in the input. For example if 120 is a valid disparity, in input volume at 120th voxel on disparity axis, you will have features corresponds to locations (x,y) and (x+120,y) for left and right images.