gallenszl / CFNet

CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching(CVPR2021)
MIT License
150 stars 23 forks source link

Weird warping results from pre-trained model disparity map #17

Open Sarthak-22 opened 2 years ago

Sarthak-22 commented 2 years ago

Hi, I am getting weird output images on warping the right image with disparity map obtained from pre-trained model. I learnt from the code that disparity map is with respect to left image, hence I tried warping the right image with the disparity map. Below is the warping code I used

def depth_read(filename):
    # loads depth map D from png file and returns it as a numpy array

    depth_png = np.array(Image.open(filename), dtype=np.int64)
    # make sure we have a proper 16bit depth map here.. not 8bit!
    #assert(np.max(depth_png) > 255)

    depth = depth_png.astype(np.float) / 256.0
    depth = depth / depth.shape[1]
    #depth[depth_png == 0] = -1.
    return depth

img = io.imread(<rightimg_filepath>)  # right image
disp = depth_read(<disparity_filepath>) # disparity map with respect to left image

print(img.shape, disp.shape) # (375,1242,3), (375,1242)

img = torch.from_numpy(img.transpose(2,0,1)).float().unsqueeze(0) / 255.0 # img
disp = torch.from_numpy(disp).float().unsqueeze(0).unsqueeze(0) # disp

print(img.shape, disp.shape) # (1, 3, 375, 1242), (1, 1, 375, 1242)

def apply_disparity(img,disp): # gets a warped output
  batch_size, _, height, width = img.size()

  # Original coordinates of pixels
  x_base = torch.linspace(0, 1, width).repeat(batch_size, height, 1).type_as(img)
  y_base = torch.linspace(0, 1, height).repeat(batch_size, width, 1).transpose(1, 2).type_as(img)

  # Apply shift in X direction
  x_shifts = disp[:, 0, :, :]  # Disparity is passed in NCHW format with 1 channel
  flow_field = torch.stack((x_base + x_shifts, y_base), dim=3)
  # In grid_sample coordinates are assumed to be between -1 and 1
  output = F.grid_sample(img, 2*flow_field - 1, mode='bilinear', padding_mode='zeros', 
  align_corners=True)

  return output

output = (apply_disparity(img, -disp)*255.0).detach()[0,:,:,:].cpu().numpy().transpose(1,2,0)
output.shape # (375, 1242, 3)

The disparity maps are obtained from both sceneflow_checkpoint and finetuned_model checkpoint. I warped the same image with these 2 disparity maps but I seem to get the same irregular output. I have used the above warping code many times and I don't think there is any problem with the code. I believe the problem is with disparity map itself. Can someone help me out regarding what could possibly have gone wrong.

Below is the input right image - https://i.stack.imgur.com/aZia5.jpg

Below is the output warped right (also the estimated left) image I got - https://i.stack.imgur.com/tHCGo.jpg

gallenszl commented 2 years ago

This is indeed a little strange. Did you try to warp the left image according to the disparity map generated by other methods, i.e., GANet? Maybe you can give me a visualization result of other methods so that we can better understand the problem.

Sarthak-22 commented 2 years ago

Yes, the GANet disparity map gives pretty much accurate results. GANet disparity is with respect to left image. Below link has the warping results with GANet - i) Actual left image ii) Output warped image (warped right/estimated left) iii) Disparity map iv) Input right image

GANet_warping

Though the image used here is different than the one used above, I don't think that should be an issue. For getting this warped output, I used exactly the same code as given above, just changing the img and disp filepaths, nothing else. Please look into this.

HenryChen98 commented 2 years ago

I ran some warping experiments in SceneFLow dataset, with GT disparity, and the reason why warping results are wierd is probably occlusion. 图片

for example, the shade area circled by red line in the left image. When we are warping the right image toleft image, we look at the corresponding area in theleft disparity. The red area in left disparity tells us displacement here is quite small, so we sample the pixels in the corresponding red area in right image. However, red area in right image is actually occluded by the purple area (the front wheel) which is why in the warped right-to-left image this red area samples pixels from the front wheel in right image. Vice versa, warping left image to right imageusing right disparity also produces similar artifacts at the back of the car. the example image pair is different from yours but i think it's ok to conclude that occlusion is why warping has such artifacts here. (OR maybe in every multi-view systems...) As for your GANet-warping having more accurate result, i believe its due to occlusion in that scene being relatively small for objects are more far away.

Sarthak-22 commented 2 years ago

Your explanation is proper but how does that explain the doubling effect taking place in the very first warped image posted above. Even objects slightly far off have this doubling effect taking place. I tried warping images with nearby objects from the disparity map generated via GANet and the same doubling problem persists. Maybe this is a problem only in case of stereo matching (and not monodepth estimation) but I am not sure.

HenryChen98 commented 2 years ago

I believe that this doubling effect is also caused by occlusion. For areas that'v been occluded, there ought to be no correspondence in the other views, whereas GT disprityOR disparity derived by end-to-end GANet still has valid disparity there(we clearly can't see holes in the disparity map).

If there's no corresponding pixels in the other view since the corresponding background is occluded by foreground, then chance is that pixels to be sampled in the other view is from the foreground objects. Hence the doubling effect of the foreground objects. (e.g. traffic lights in your case).

Moreover, i find out that for occluded area, this doubling effect is extremly severe when depth between background and foreground is large. 图片

For example, at the upper part of the gound&traffic lignt, depth difference between gound&traffic lignt is larger so bigger area of the ground are incorrectly sampled from the traffic lignt, while at the bottom, depth of the gound&traffic lignt are identical, hence no pixels are incorecly sampled from the traffic light, resulting in a "Y" shape doubling artifact.

i also tried to compute the occluded areas in the left image i define the occluded background pixels in the left image as:

if` [p1_left, p2_left, ..., pk_left] -- correspond to --> p_right                     ##(Note that p1_left < p2_left < ... < pk_left )
    then [p1_left, p2_left, ..., pk-1_left] are occluded background pixels, while pk_left is the foreground pixel.

they are occluded since several pixels in the left image are pointing to one pixel in the right image. and the result is quite reasonable. (the blue areas are the computed occluded areas in the left image) 图片 maybe this can help to seperate the wierld doubling effect, but sadly i can't find a way how we can sample pixels in these areas from the right image... Actually in Video Super-resolution field, people are using deformable conv network to align images with optical flow instead of simply warping. maybe they can provide u with some enlightenment. ☞ https://github.com/YapengTian/TDAN-VSR-CVPR-2020

Sarthak-22 commented 2 years ago

Your explanation does make sense. Thanks a lot. So other than the Github repo you linked, is there any other simpler and faster way to avoid this occlusion problem in case of stereo matching? Also, is this a general problem in the case of stereo matching with any architecture (GANet, PSMNet etc)? Because I never faced this weird effect in the case of monodepth estimation.

HenryChen98 commented 2 years ago

Hi, i believe this problem is general in stereo senario. But there's nothing to do with the stereo matching algorithms.

Note that why we encounter this problem while trying to warp a view to the reference view according to disparity is that in multi-view imaging systems(including stereo camera), occlusion means: some areas of the scene are never observed by a subset of cameras. Hence sampling unseen pixels to warp to another view is undoubtedly impossible.

Meanwhile, occlusion is also a challenging problem in stereo matching to predict depth, but with the exciting progress of the deep learning, methods likeGANet, PSMNet can try to predict the disparity in occluded areas by implicity applying some high level constraints(for example: disparity are smooth to some extent). This is exciting if we want to obtain the depth of a scene, but it still doesn't mean we find correspondences in another view for occluded areas. So if we can not find corresponding pixels in the target image to warp to the reference image, especially when we are using back warping, then for those occluded areas we are actually sampling using foreground pixels instead of unseen background pixels which causes doubling effect.

Also u mentioned that this never occured in monodepth estimation, here's some reasons i can think of: 1) in monodepth estimation, there obviously exists no occlusion. (Probably because i am not familiar with that lol) 2) Maybe in monodepth estimation, warping operation is forward warping, which might not suffer the problem of sampling from wrong pixels. But i think forward warping might produce holes or sth? I am not sure.

Anyway, that's my opinion.

As for a fast way to avoid this occlusion problem in stereo matching, my solution is that we only warp the unoccluded pixels, and if we want the warped image to be reasonable, maybe in those occluded areas we can sample them from the reference image(which is cheating because no information of the target image is obtained here :).

For example, first we warp the right image to left and compute the mask excluding occluded areas as well as the out-of-sight areas in the reference left image.(depicted at upper right of the plot below). The image pair is from KITTI2015 and the disparity is derived by PSMNet. applying this mask on the initial warped right-to-left image, we obtain a valid warped right-to-left image. So far that's basically what we can do to warp the target right image to left view based on shared information observed by both views.

图片

Furthermore, if we want a reasonable warped right-to-left image, we can cheat a little bit to sample these invalid areas from the reference left image. The result is great except for some pixels in textureless areas !

图片

For your convenience, here's my code to compute the invalid occluded and out-of-sight areas. My code might not be efficient though. If u have a more efficient implementation, i'd appreciate it for u to share with me.


#-----warping operation---CHR------
def occlude_det(xgrid, disp, if_flow_negative=True):
    #[batch, 1, H, W]     
    batch, chn, H, W = xgrid.size()
    print("width: ", W)

    xgrid = xgrid.detach().cpu().numpy()
    disp = disp.detach().cpu().numpy()

    flow_xgrid = xgrid + disp

#    accumulate_grid = flow_xgrid[:, :, :, 1:] - flow_xgrid[:, :, :, :W-1]
#    
#    accumulate_grid[accumulate <= 0] = 0
#    accumulate_grid[accumulate > 0] = 1
#    print(accumulate_grid.size())

    occlude_map = np.ones_like(flow_xgrid)

    for iter_batch in range(batch):
        for iter_chn in range(chn):
            for iter_row in range(H):
#                if(iter_row > 0):
#                    break

                if(iter_row % 50 == 0):
                    print("-----------processing %d row..---------" % (iter_row))
                idx_target = {}
                #table to store corresponding ref pixel coords for every idx of target coord
                for iter_col in range(W):
                    idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))] = []
                    #initialize the table using flow_xgrid as keys

                for iter_col in range(W):
                    #flow_xgrid has floating and negative values...
                    idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].append(iter_col)
                    #fill the table
#                    print("len table", len(idx_target))

                    #for areas in ref image that map to areas outside of target image, they are valid too
                    #in short: set out-of-sight areas to zero
                    if not (0 < (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col])) < W):
                        occlude_map[iter_batch, iter_chn, iter_row, iter_col] = 0

                for iter_col in range(W):
                    if(if_flow_negative):
                        idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].pop()
                        #pop the max ref pixel coord,  which in this case (when flow is negative) means most front objects, then pixel coords left here are background objects being occluded in target image
                    else:
                        idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].pop(0)
#                    print(len(idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))]))
                        #pop the min ref pixel coord, which in this case (when flow is positive) means most front objects, then pixel coords left here are background objects being occluded in target image

                    list_ref_idx = idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))]
                    for iter_c, coord in enumerate(list_ref_idx):
                        #after popping out the front pixels, what ever left here are occluded pixels 
                        #set occluded areas to zero
                        occlude_map[iter_batch, iter_chn, iter_row, coord] = 0

    print("number of occlude pixels: ",H * W - np.sum(occlude_map))

    return occlude_map
Sarthak-22 commented 2 years ago

Ohh Ok. Got the point. Thanks a lot for your help, it really helped me in my analysis

So, after all this weird warping results observed, can we say that stereo matching, in general cannot generalize well and is not a good algorithm ? Why do the best methods like GANet give such weird warping results even though it has consistently topped the KITTI leaderboard. I understand that the weird results are due to occlusions but isn't stereo matching a much superior algorithm and should give relatively better outputs instead ?

jucic commented 2 years ago

I encountered with the same problem

Hi, i believe this problem is general in stereo senario. But there's nothing to do with the stereo matching algorithms.

Note that why we encounter this problem while trying to warp a view to the reference view according to disparity is that in multi-view imaging systems(including stereo camera), occlusion means: some areas of the scene are never observed by a subset of cameras. Hence sampling unseen pixels to warp to another view is undoubtedly impossible.

Meanwhile, occlusion is also a challenging problem in stereo matching to predict depth, but with the exciting progress of the deep learning, methods likeGANet, PSMNet can try to predict the disparity in occluded areas by implicity applying some high level constraints(for example: disparity are smooth to some extent). This is exciting if we want to obtain the depth of a scene, but it still doesn't mean we find correspondences in another view for occluded areas. So if we can not find corresponding pixels in the target image to warp to the reference image, especially when we are using back warping, then for those occluded areas we are actually sampling using foreground pixels instead of unseen background pixels which causes doubling effect.

Also u mentioned that this never occured in monodepth estimation, here's some reasons i can think of:

  1. in monodepth estimation, there obviously exists no occlusion. (Probably because i am not familiar with that lol)
  2. Maybe in monodepth estimation, warping operation is forward warping, which might not suffer the problem of sampling from wrong pixels. But i think forward warping might produce holes or sth? I am not sure.

Anyway, that's my opinion.

As for a fast way to avoid this occlusion problem in stereo matching, my solution is that we only warp the unoccluded pixels, and if we want the warped image to be reasonable, maybe in those occluded areas we can sample them from the reference image(which is cheating because no information of the target image is obtained here :).

For example, first we warp the right image to left and compute the mask excluding occluded areas as well as the out-of-sight areas in the reference left image.(depicted at upper right of the plot below). The image pair is from KITTI2015 and the disparity is derived by PSMNet. applying this mask on the initial warped right-to-left image, we obtain a valid warped right-to-left image. So far that's basically what we can do to warp the target right image to left view based on shared information observed by both views.

图片

Furthermore, if we want a reasonable warped right-to-left image, we can cheat a little bit to sample these invalid areas from the reference left image. The result is great except for some pixels in textureless areas !

图片

For your convenience, here's my code to compute the invalid occluded and out-of-sight areas. My code might not be efficient though. If u have a more efficient implementation, i'd appreciate it for u to share with me.


#-----warping operation---CHR------
def occlude_det(xgrid, disp, if_flow_negative=True):
    #[batch, 1, H, W]     
    batch, chn, H, W = xgrid.size()
    print("width: ", W)

    xgrid = xgrid.detach().cpu().numpy()
    disp = disp.detach().cpu().numpy()

    flow_xgrid = xgrid + disp

#    accumulate_grid = flow_xgrid[:, :, :, 1:] - flow_xgrid[:, :, :, :W-1]
#    
#    accumulate_grid[accumulate <= 0] = 0
#    accumulate_grid[accumulate > 0] = 1
#    print(accumulate_grid.size())

    occlude_map = np.ones_like(flow_xgrid)

    for iter_batch in range(batch):
        for iter_chn in range(chn):
            for iter_row in range(H):
#                if(iter_row > 0):
#                    break

                if(iter_row % 50 == 0):
                    print("-----------processing %d row..---------" % (iter_row))
                idx_target = {}
                #table to store corresponding ref pixel coords for every idx of target coord
                for iter_col in range(W):
                    idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))] = []
                    #initialize the table using flow_xgrid as keys

                for iter_col in range(W):
                    #flow_xgrid has floating and negative values...
                    idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].append(iter_col)
                    #fill the table
#                    print("len table", len(idx_target))

                    #for areas in ref image that map to areas outside of target image, they are valid too
                    #in short: set out-of-sight areas to zero
                    if not (0 < (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col])) < W):
                        occlude_map[iter_batch, iter_chn, iter_row, iter_col] = 0

                for iter_col in range(W):
                    if(if_flow_negative):
                        idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].pop()
                        #pop the max ref pixel coord,  which in this case (when flow is negative) means most front objects, then pixel coords left here are background objects being occluded in target image
                    else:
                        idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))].pop(0)
#                    print(len(idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))]))
                        #pop the min ref pixel coord, which in this case (when flow is positive) means most front objects, then pixel coords left here are background objects being occluded in target image

                    list_ref_idx = idx_target["%d" % (math.floor(flow_xgrid[iter_batch, iter_chn, iter_row, iter_col]))]
                    for iter_c, coord in enumerate(list_ref_idx):
                        #after popping out the front pixels, what ever left here are occluded pixels 
                        #set occluded areas to zero
                        occlude_map[iter_batch, iter_chn, iter_row, coord] = 0

    print("number of occlude pixels: ",H * W - np.sum(occlude_map))

    return occlude_map

Thanks for your analysis, I encountered with the same problem when trying to warp a image with optical flow(or disparity), and I have the same conclusion with you.

Actually in Video Super-resolution field, people are using deformable conv network to align images with optical flow instead of simply warping.

above is another solution you mentioned, do you know the performance of this solution?

Liyunf123 commented 2 years ago

作者大大,请问你给的这个代码那两个输入都是什么?

Liyunf123 commented 2 years ago

Ohh Ok. Got the point. Thanks a lot for your help, it really helped me in my analysis

So, after all this weird warping results observed, can we say that stereo matching, in general cannot generalize well and is not a good algorithm ? Why do the best methods like GANet give such weird warping results even though it has consistently topped the KITTI leaderboard. I understand that the weird results are due to occlusions but isn't stereo matching a much superior algorithm and should give relatively better outputs instead ?

Hi,I would like to know what is the input in the code provided by the author, what is the xgrid? Could you help me? My English is so poor. Sorry!

wgqt1zl commented 9 months ago

solved for me thanks