HKUST-Aerial-Robotics / MVDepthNet

This repository provides PyTorch implementation for 3DV 2018 paper "MVDepthNet: real-time multiview depth estimation neural network"
GNU General Public License v3.0
308 stars 72 forks source link

Did I missed something to get a good depth? #8

Closed lc82111 closed 5 years ago

lc82111 commented 5 years ago

Thanks for your great work. I refer the example2.py to use my own data. However, the result seems not good. The code and images are provided below. Please tell me if I missed anything.

left_image = cv2.imread("./left.jpeg")
right_image = cv2.imread("./right.jpeg")

camera_k_left = np.asarray([
                            [1.7141879128232438e+003, 0.,                      1.2686456493940061e+003],
                            [0,                       1.7141879128232438e+003, 9.9575285430241513e+002],
                            [0,                       0,                       1]])

camera_k_right = np.asarray([ 
                            [1.7141879128232438e+003, 0.,                      1.2666075491361062e+003],
                            [0,                       1.7141879128232438e+003, 9.8047895362229440e+002],
                            [0,                       0,                       1]])

left2right = np.asarray([
                        [9.9969708004761548e-001, -1.7112957892382444e-002,  1.7688833100150528e-002, -8.3976622746264312e+001],
                        [1.6926228781311496e-002, 9.9979999147940424e-001,   1.0652690600304717e-002, 6.4193373297895686e+000],
                        [-1.7867594228494717e-002, -1.0350058451847681e-002, 9.9978678995400272e-001, -2.9538222186700258e+000],
                        [0,                       0,                         0,                        1]])

## process images
# scale to 320x256
original_width = left_image.shape[1]
original_height = left_image.shape[0]
factor_x = 320.0 / original_width
factor_y = 256.0 / original_height

left_image = cv2.resize(left_image, (320, 256)) # (256, 320)
right_image = cv2.resize(right_image, (320, 256)) # (256, 256)
camera_k_left[0, :] *= factor_x
camera_k_left[1, :] *= factor_y
camera_k_right[0, :] *= factor_x
camera_k_right[1, :] *= factor_y

# convert to pytorch format
torch_left_image = np.moveaxis(left_image, -1, 0) # (3, 256, 320)
torch_left_image = np.expand_dims(torch_left_image, 0) # (1, 3, 256, 320)
torch_left_image = (torch_left_image - 81.0)/ 35.0 # whiten
torch_right_image = np.moveaxis(right_image, -1, 0)
torch_right_image = np.expand_dims(torch_right_image, 0)
torch_right_image = (torch_right_image - 81.0) / 35.0

left_image_cuda = Tensor(torch_left_image).cuda()
left_image_cuda = Variable(left_image_cuda, volatile=True)

right_image_cuda = Tensor(torch_right_image).cuda()
right_image_cuda = Variable(right_image_cuda, volatile=True)

## process camera params
# for warp the image to construct the cost volume
pixel_coordinate = np.indices([320, 256]).astype(np.float32)
pixel_coordinate = np.concatenate((pixel_coordinate, np.ones([1, 320, 256])), axis=0)
pixel_coordinate = np.reshape(pixel_coordinate, [3, -1]) # [0,:] in [0,319]; [1,:] in [0,255]; [2,:]==1;

left_in_right_T = left2right[0:3, 3]  # translation vector
left_in_right_R = left2right[0:3, 0:3]  # rotation matrix
KL = camera_k_left
KR = camera_k_right
KL_inverse = inv(KL)
KRK_i = KR.dot(left_in_right_R.dot(KL_inverse))
KRKiUV = KRK_i.dot(pixel_coordinate)
KT = KR.dot(left_in_right_T)
KT = np.expand_dims(KT, -1)
KT = np.expand_dims(KT, 0)
KT = KT.astype(np.float32)
KRKiUV = KRKiUV.astype(np.float32)
KRKiUV = np.expand_dims(KRKiUV, 0)
KRKiUV_cuda_T = Tensor(KRKiUV).cuda()
KT_cuda_T = Tensor(KT).cuda()

# model
depthnet = depthNet()
model_data = torch.load('opensource_model.pth.tar')
depthnet.load_state_dict(model_data['state_dict'])
depthnet = depthnet.cuda()
cudnn.benchmark = True
depthnet.eval()

predict_depths = depthnet(left_image_cuda, right_image_cuda, KRKiUV_cuda_T, KT_cuda_T)

left right

epipolar depth

WANG-KX commented 5 years ago

Dear,

Thanks for your interest in the project! I haven't fully checked the data you provided, but I have a small question. Is the translation in left2right in meter scale? Cause it seems that the translation is not so big as shown in the images.

Regards, Kaixuan

lc82111 commented 5 years ago

Thanks for the reply. The translation is in mm.

WANG-KX commented 5 years ago

Ok ... The default unit is the meter in the network. Also, there are very few textures in the environment so that it is a very difficult stereo problem for most of the solutions (not to mention motion stereo solutions). You can try other stereo methods on this images, I think the background on the right part will be hard to be estimated.

lc82111 commented 5 years ago

Thanks for the advice.

The default unit is the meter in the network.

Does it mean that I should re-calibrate my stereo camera in the meter to use the network?

WANG-KX commented 5 years ago

You don't need to recalibrate, just divide the values by 1000 if the calibration is right and in mm scales. Also, if you are using stereo cameras, I recommend using stereo matching methods to get better results.

lc82111 commented 5 years ago

Thanks for the reply.

if you are using stereo cameras, I recommend using stereo matching methods to get better results.

My task asks to reconstruct the 3D human surface from stereo cameras without structured light. The difficulty lies in the textureless human body. So, I turn to this project for help. Any suggested stereo matching methods? Thanks.

WANG-KX commented 5 years ago

For stereo methods, you can go to kitti stereo or middlebury stereo for more information.