YvanYin / Metric3D

The repo for "Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image" and "Metric3Dv2: A Versatile Monocular Geometric Foundation Model..."
https://jugghm.github.io/Metric3Dv2/
BSD 2-Clause "Simplified" License
1.43k stars 107 forks source link

How to improve the metric depth value? #161

Open MoAbbasid opened 2 months ago

MoAbbasid commented 2 months ago

Hi, I need to get distance to an object, I gathered a small dataset of outdoor images at a varied distances to test, and the model results are varied, My Questions are:

What is the best practice to improve the results?, I already calibrated and have the intrinsics, what else can I do? the model is clipped to be under a certain value to account for the sky correct?

My images are w=3024, h=4032, I provided the code I use to generate depth and visualization below

image

The red dot point is 4m from the camera, but I got (11.45) from the vit-small model and (20.764) from vit-large model, obviously way off. another test I ran at 2m produced (1.3) fro, vit_small and (1.5) for vit_large, which is still not ideal but workable.

rgb_file = '/content/MG_5u_4m.jpg'  
input_size = (616, 1064)  
intrinsic = [3000, 3000, 1529.95662, 1976.17563]  # camera's intrinsic parameters
padding_values = [123.675, 116.28, 103.53]

# Load and preprocess image
rgb_origin = cv2.imread(rgb_file)[:, :, ::-1]  

# Adjust input size to fit the model
h, w = rgb_origin.shape[:2]
scale = min(input_size[0] / h, input_size[1] / w)
rgb = cv2.resize(rgb_origin, (int(w * scale), int(h * scale)), interpolation=cv2.INTER_LINEAR)

# Scale intrinsic parameters
intrinsic = [intrinsic[0] * scale, intrinsic[1] * scale, intrinsic[2] * scale, intrinsic[3] * scale]

# Padding
h, w = rgb.shape[:2]
pad_h = input_size[0] - h
pad_w = input_size[1] - w
pad_h_half = pad_h // 2
pad_w_half = pad_w // 2
rgb = cv2.copyMakeBorder(rgb, pad_h_half, pad_h - pad_h_half, pad_w_half, pad_w - pad_w_half, cv2.BORDER_CONSTANT, value=padding_values)
pad_info = [pad_h_half, pad_h - pad_h_half, pad_w_half, pad_w - pad_w_half]

# Normalize
mean = torch.tensor([123.675, 116.28, 103.53]).float()[:, None, None]
std = torch.tensor([58.395, 57.12, 57.375]).float()[:, None, None]
rgb = torch.from_numpy(rgb.transpose((2, 0, 1))).float()
rgb = torch.div((rgb - mean), std)
rgb = rgb[None, :, :, :].cuda()

# Load model
model = torch.hub.load('yvanyin/metric3d', 'metric3d_vit_small', pretrain=True)
model.cuda().eval()
# cuda()

# Perform inference
with torch.no_grad():
  pred_depth, confidence, output_dict = model.inference({'input': rgb})

# un pad
pred_depth = pred_depth.squeeze()
pred_depth = pred_depth[pad_info[0] : pred_depth.shape[0] - pad_info[1], pad_info[2] : pred_depth.shape[1] - pad_info[3]]

# upsample to original size
pred_depth = torch.nn.functional.interpolate(pred_depth[None, None, :, :], rgb_origin.shape[:2], mode='bilinear').squeeze()
  ###################### canonical camera space ######################

#### de-canonical transform
canonical_to_real_scale = intrinsic[0] / 1000.0 # 1000.0 is the focal length of canonical camera
pred_depth = pred_depth * canonical_to_real_scale # now the depth is metric
pred_depth = torch.clamp(pred_depth, 0, 300)

Any info to get this close to the real world scale is appreciated

oywenjun11 commented 1 month ago

@MoAbbasid Hello, if you can only adjust the post-processing part without changing the model, what I have been trying is to adjust fx and fy in its internal parameter matrix. image

MoAbbasid commented 1 month ago

Hi @oywenjun11 , I obtained these values by calibrating with opencv, using the chessboard pattern, do you suggest just randomly changing these values? and keep cx, cy, the same?

oywenjun11 commented 1 month ago

你好@oywenjun11,我通过使用 opencv 校准获得这些值,使用棋盘图案,你建议随机改变这些值吗?并保持 cx、cy 不变?

@MoAbbasid Hello, my suggestion is to try modifying fx and fy. It is not necessarily the result after you have calibrated opencv. Because this part of the post-processing is simply scaling the predicted depth map results. image

sukasu403 commented 1 month ago

@MoAbbasid Hello, I have also used OpenCV's chessboard calibration to calibrate the intrinsic parameters, but I found that its results fluctuate a lot (probably due to the strict quality requirements for the calibration images), making it difficult to calibrate accurately. It even performs worse than using default parameters.

MoAbbasid commented 1 month ago

Hello @oywenjun11 @sukasu403 , I tried adjusting the focal length fx, fy, but that doesnt work as the result is not consistent, meaning they are different from image to image,

in the same image where the distance is 4m, I got the desired result at f=500

image

but for the second picture where the gt is actually 2m, I got the desired result at f=250 image

so its not consistent in different pictures,

what else can I try to do to get some uniform metric values? could it be that the sky values are causing the depth result to be large?

YvanYin commented 1 month ago

As most training images have larger widths, so maybe can try to adjust the height/width.

MoAbbasid commented 1 month ago

@YvanYin, so I flipped the height and width,

in the 2m GT image I got 2m at intrinsic = [500, 500, 1512, 2016] image

in the 4m GT image I got 4m at intrinsic = [750, 750, 1512, 2016] image

varied results for other distances and images as well, Im using vit_large, any other suggestion? how big of a role does the sky values play?

JUGGHM commented 3 weeks ago

@YvanYin, so I flipped the height and width,

in the 2m GT image I got 2m at intrinsic = [500, 500, 1512, 2016] image

in the 4m GT image I got 4m at intrinsic = [750, 750, 1512, 2016] image

varied results for other distances and images as well, Im using vit_large, any other suggestion? how big of a role does the sky values play?

Oh we normally do not transpose the height and width. For the sky values... we suggest just use the confidence map to filter them out.

For the inconsistency, what about center crop the first image and resize it to the original size? I think this inconsistency is largely caused by insufficient training data with large image size and small focal length.