Open Sarthak-22 opened 2 years ago
This is indeed a little strange. Did you try to warp the left image according to the disparity map generated by other methods, i.e., GANet? Maybe you can give me a visualization result of other methods so that we can better understand the problem.
Yes, the GANet disparity map gives pretty much accurate results. GANet disparity is with respect to left image. Below link has the warping results with GANet - i) Actual left image ii) Output warped image (warped right/estimated left) iii) Disparity map iv) Input right image
Though the image used here is different than the one used above, I don't think that should be an issue.
For getting this warped output, I used exactly the same code as given above, just changing the img
and disp
filepaths, nothing else.
Please look into this.
I ran some warping experiments in SceneFLow dataset, with GT disparity, and the reason why warping results are wierd is probably occlusion.
for example, the shade area circled by red line in the left image.
When we are warping the right image
toleft image
, we look at the corresponding area in theleft disparity
.
The red area in left disparity
tells us displacement here is quite small, so we sample the pixels in the corresponding
red area in right image
. However, red area in right image
is actually occluded by the purple area (the front wheel) which is why in the warped right-to-left image
this red area samples pixels from the front wheel in right image
.
Vice versa, warping left image
to right image
using right disparity
also produces similar artifacts at the back of the car.
the example image pair is different from yours but i think it's ok to conclude that occlusion is why warping has such artifacts here.
(OR maybe in every multi-view systems...)
As for your GANet-warping
having more accurate result, i believe its due to occlusion in that scene being relatively small for objects are more far away.
Your explanation is proper but how does that explain the doubling effect taking place in the very first warped image posted above. Even objects slightly far off have this doubling effect taking place. I tried warping images with nearby objects from the disparity map generated via GANet and the same doubling problem persists. Maybe this is a problem only in case of stereo matching (and not monodepth estimation) but I am not sure.
I believe that this doubling effect is also caused by occlusion.
For areas that'v been occluded, there ought to be no correspondence in the other views, whereas GT disprity
OR
disparity derived by end-to-end GANet still has valid disparity there(we clearly can't see holes in the disparity map).
If there's no corresponding pixels in the other view since the corresponding background is occluded by foreground, then chance is that pixels to be sampled in the other view is from the foreground objects. Hence the doubling effect of the foreground objects. (e.g. traffic lights in your case).
Moreover, i find out that for occluded area, this doubling effect is extremly severe when depth between
background and foreground is large.
For example, at the upper part of the gound&traffic lignt, depth difference between gound&traffic lignt is larger so bigger area of the ground are incorrectly sampled from the traffic lignt, while at the bottom, depth of the gound&traffic lignt are identical, hence no pixels are incorecly sampled from the traffic light, resulting in a "Y" shape doubling artifact.
i also tried to compute the occluded areas in the left image i define the occluded background pixels in the left image as:
if` [p1_left, p2_left, ..., pk_left] -- correspond to --> p_right ##(Note that p1_left < p2_left < ... < pk_left )
then [p1_left, p2_left, ..., pk-1_left] are occluded background pixels, while pk_left is the foreground pixel.
they are occluded since several pixels in the left image are pointing to one pixel in the right image.
and the result is quite reasonable.
(the blue areas are the computed occluded areas in the left image)
maybe this can help to seperate the wierld doubling effect, but sadly i can't find a way how we can
sample pixels in these areas from the right image...
Actually in Video Super-resolution field, people are using deformable conv network to align images with optical flow
instead of simply warping.
maybe they can provide u with some enlightenment.
☞ https://github.com/YapengTian/TDAN-VSR-CVPR-2020
Your explanation does make sense. Thanks a lot. So other than the Github repo you linked, is there any other simpler and faster way to avoid this occlusion problem in case of stereo matching? Also, is this a general problem in the case of stereo matching with any architecture (GANet, PSMNet etc)? Because I never faced this weird effect in the case of monodepth estimation.
Hi, i believe this problem is general in stereo senario. But there's nothing to do with the stereo matching algorithms.
Note that why we encounter this problem while trying to warp a view to the reference view according to disparity is that in multi-view imaging systems(including stereo camera), occlusion means: some areas of the scene are never observed by a subset of cameras. Hence sampling unseen pixels to warp to another view is undoubtedly impossible.
Meanwhile, occlusion is also a challenging problem in stereo matching to predict depth, but with the exciting progress of the deep learning, methods likeGANet
, PSMNet
can try to predict the disparity in occluded areas by implicity applying some high level constraints(for example: disparity are smooth to some extent). This is exciting if we want to obtain the depth of a scene, but it still doesn't mean we find correspondences in another view for occluded areas.
So if we can not find corresponding pixels in the target image to warp to the reference image, especially when we are using
back warping
, then for those occluded areas we are actually sampling using foreground pixels instead of unseen background pixels which causes doubling effect.
Also u mentioned that this never occured in monodepth estimation, here's some reasons i can think of:
1) in monodepth estimation, there obviously exists no occlusion. (Probably because i am not familiar with that lol)
2) Maybe in monodepth estimation, warping operation is forward warping
, which might not suffer the problem of sampling from wrong pixels. But i think forward warping might produce holes or sth? I am not sure.
Anyway, that's my opinion.
As for a fast way to avoid this occlusion problem in stereo matching, my solution is that we only warp the unoccluded pixels, and if we want the warped image to be reasonable, maybe in those occluded areas we can sample them from the reference image(which is cheating because no information of the target image is obtained here :).
For example, first we warp the right image to left and compute the mask excluding occluded areas as well as the out-of-sight areas in the reference left image.(
depicted at upper right of the plot below). The image pair is from KITTI2015
and the disparity is derived by PSMNet
.
applying this mask on the initial warped right-to-left image
, we obtain a valid warped right-to-left image
.
So far that's basically what we can do to warp the target right image
to left view based on shared information observed by
both views.
Furthermore, if we want a reasonable warped right-to-left image, we can cheat a little bit to sample these invalid areas from the reference left image. The result is great except for some pixels in textureless areas !
For your convenience, here's my code to compute the invalid occluded and out-of-sight areas. My code might not be efficient though. If u have a more efficient implementation, i'd appreciate it for u to share with me.
#-----warping operation---CHR------
def occlude_det(xgrid, disp, if_flow_negative=True):
#[batch, 1, H, W]
batch, chn, H, W = xgrid.size()
print("width: ", W)
xgrid = xgrid.detach().cpu().numpy()
disp = disp.detach().cpu().numpy()
flow_xgrid = xgrid + disp
# accumulate_grid = flow_xgrid[:, :, :, 1:] - flow_xgrid[:, :, :, :W-1]
#
# accumulate_grid[accumulate <= 0] = 0
# accumulate_grid[accumulate > 0] = 1
# print(accumulate_grid.size())
occlude_map = np.ones_like(flow_xgrid)
for iter_batch in range(batch):
for iter_chn in range(chn):
for iter_row in range(H):
# if(iter_row > 0):
# break
if(iter_row % 50 == 0):
print("-----------processing %d row..---------" % (iter_row))
idx_target = {}
#table to store corresponding ref pixel coords for every idx of target coord
for iter_col in range(W):
idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))] = []
#initialize the table using flow_xgrid as keys
for iter_col in range(W):
#flow_xgrid has floating and negative values...
idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].append(iter_col)
#fill the table
# print("len table", len(idx_target))
#for areas in ref image that map to areas outside of target image, they are valid too
#in short: set out-of-sight areas to zero
if not (0 < (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col])) < W):
occlude_map[iter_batch, iter_chn, iter_row, iter_col] = 0
for iter_col in range(W):
if(if_flow_negative):
idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].pop()
#pop the max ref pixel coord, which in this case (when flow is negative) means most front objects, then pixel coords left here are background objects being occluded in target image
else:
idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].pop(0)
# print(len(idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))]))
#pop the min ref pixel coord, which in this case (when flow is positive) means most front objects, then pixel coords left here are background objects being occluded in target image
list_ref_idx = idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))]
for iter_c, coord in enumerate(list_ref_idx):
#after popping out the front pixels, what ever left here are occluded pixels
#set occluded areas to zero
occlude_map[iter_batch, iter_chn, iter_row, coord] = 0
print("number of occlude pixels: ",H * W - np.sum(occlude_map))
return occlude_map
Ohh Ok. Got the point. Thanks a lot for your help, it really helped me in my analysis
So, after all this weird warping results observed, can we say that stereo matching, in general cannot generalize well and is not a good algorithm ? Why do the best methods like GANet give such weird warping results even though it has consistently topped the KITTI leaderboard. I understand that the weird results are due to occlusions but isn't stereo matching a much superior algorithm and should give relatively better outputs instead ?
I encountered with the same problem
Hi, i believe this problem is general in stereo senario. But there's nothing to do with the stereo matching algorithms.
Note that why we encounter this problem while trying to warp a view to the reference view according to disparity is that in multi-view imaging systems(including stereo camera), occlusion means: some areas of the scene are never observed by a subset of cameras. Hence sampling unseen pixels to warp to another view is undoubtedly impossible.
Meanwhile, occlusion is also a challenging problem in stereo matching to predict depth, but with the exciting progress of the deep learning, methods like
GANet
,PSMNet
can try to predict the disparity in occluded areas by implicity applying some high level constraints(for example: disparity are smooth to some extent). This is exciting if we want to obtain the depth of a scene, but it still doesn't mean we find correspondences in another view for occluded areas. So if we can not find corresponding pixels in the target image to warp to the reference image, especially when we are usingback warping
, then for those occluded areas we are actually sampling using foreground pixels instead of unseen background pixels which causes doubling effect.Also u mentioned that this never occured in monodepth estimation, here's some reasons i can think of:
- in monodepth estimation, there obviously exists no occlusion. (Probably because i am not familiar with that lol)
- Maybe in monodepth estimation, warping operation is
forward warping
, which might not suffer the problem of sampling from wrong pixels. But i think forward warping might produce holes or sth? I am not sure.Anyway, that's my opinion.
As for a fast way to avoid this occlusion problem in stereo matching, my solution is that we only warp the unoccluded pixels, and if we want the warped image to be reasonable, maybe in those occluded areas we can sample them from the reference image(which is cheating because no information of the target image is obtained here :).
For example, first we warp the right image to left and compute the mask excluding occluded areas as well as the out-of-sight areas in the
reference left image.(
depicted at upper right of the plot below). The image pair is fromKITTI2015
and the disparity is derived byPSMNet
. applying this mask on the initialwarped right-to-left image
, we obtain avalid warped right-to-left image
. So far that's basically what we can do to warp the targetright image
to left view based on shared information observed by both views.
Furthermore, if we want a reasonable warped right-to-left image, we can cheat a little bit to sample these invalid areas from the reference left image. The result is great except for some pixels in textureless areas !
For your convenience, here's my code to compute the invalid occluded and out-of-sight areas. My code might not be efficient though. If u have a more efficient implementation, i'd appreciate it for u to share with me.
#-----warping operation---CHR------ def occlude_det(xgrid, disp, if_flow_negative=True): #[batch, 1, H, W] batch, chn, H, W = xgrid.size() print("width: ", W) xgrid = xgrid.detach().cpu().numpy() disp = disp.detach().cpu().numpy() flow_xgrid = xgrid + disp # accumulate_grid = flow_xgrid[:, :, :, 1:] - flow_xgrid[:, :, :, :W-1] # # accumulate_grid[accumulate <= 0] = 0 # accumulate_grid[accumulate > 0] = 1 # print(accumulate_grid.size()) occlude_map = np.ones_like(flow_xgrid) for iter_batch in range(batch): for iter_chn in range(chn): for iter_row in range(H): # if(iter_row > 0): # break if(iter_row % 50 == 0): print("-----------processing %d row..---------" % (iter_row)) idx_target = {} #table to store corresponding ref pixel coords for every idx of target coord for iter_col in range(W): idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))] = [] #initialize the table using flow_xgrid as keys for iter_col in range(W): #flow_xgrid has floating and negative values... idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].append(iter_col) #fill the table # print("len table", len(idx_target)) #for areas in ref image that map to areas outside of target image, they are valid too #in short: set out-of-sight areas to zero if not (0 < (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col])) < W): occlude_map[iter_batch, iter_chn, iter_row, iter_col] = 0 for iter_col in range(W): if(if_flow_negative): idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].pop() #pop the max ref pixel coord, which in this case (when flow is negative) means most front objects, then pixel coords left here are background objects being occluded in target image else: idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].pop(0) # print(len(idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))])) #pop the min ref pixel coord, which in this case (when flow is positive) means most front objects, then pixel coords left here are background objects being occluded in target image list_ref_idx = idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))] for iter_c, coord in enumerate(list_ref_idx): #after popping out the front pixels, what ever left here are occluded pixels #set occluded areas to zero occlude_map[iter_batch, iter_chn, iter_row, coord] = 0 print("number of occlude pixels: ",H * W - np.sum(occlude_map)) return occlude_map
Thanks for your analysis, I encountered with the same problem when trying to warp a image with optical flow(or disparity), and I have the same conclusion with you.
Actually in Video Super-resolution field, people are using deformable conv network to align images with optical flow instead of simply warping.
above is another solution you mentioned, do you know the performance of this solution?
作者大大,请问你给的这个代码那两个输入都是什么?
Ohh Ok. Got the point. Thanks a lot for your help, it really helped me in my analysis
So, after all this weird warping results observed, can we say that stereo matching, in general cannot generalize well and is not a good algorithm ? Why do the best methods like GANet give such weird warping results even though it has consistently topped the KITTI leaderboard. I understand that the weird results are due to occlusions but isn't stereo matching a much superior algorithm and should give relatively better outputs instead ?
Hi,I would like to know what is the input in the code provided by the author, what is the xgrid? Could you help me? My English is so poor. Sorry!
solved for me thanks
Hi, I am getting weird output images on warping the right image with disparity map obtained from pre-trained model. I learnt from the code that disparity map is with respect to left image, hence I tried warping the right image with the disparity map. Below is the warping code I used
The disparity maps are obtained from both sceneflow_checkpoint and finetuned_model checkpoint. I warped the same image with these 2 disparity maps but I seem to get the same irregular output. I have used the above warping code many times and I don't think there is any problem with the code. I believe the problem is with disparity map itself. Can someone help me out regarding what could possibly have gone wrong.
Below is the input right image - https://i.stack.imgur.com/aZia5.jpg
Below is the output warped right (also the estimated left) image I got - https://i.stack.imgur.com/tHCGo.jpg