AnjieCheng / SpatialRGPT

[NeurIPS'24] This repository is the implementation of "SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models"
https://www.anjiecheng.me/SpatialRGPT
Apache License 2.0
30 stars 4 forks source link

How to preprocess the depth input? Thanks! #2

Open AndyCao1125 opened 1 week ago

AndyCao1125 commented 1 week ago

Thanks for your great work!

I have a question: May I ask if the depth images are generated before training or generated by depth-anything during training?

Thank you!

AnjieCheng commented 1 week ago

Hi, we pre-generate the depth data before training to speed up the overall process. As mentioned in the README, we use the following function to save the output from DepthAnything:

def save_raw_16bit(depth, fpath, height, width):
  depth = F.interpolate(depth[None, None], (height, width), mode='bilinear', align_corners=False)[0, 0]
  depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0
  depth = depth.cpu().numpy().astype(np.uint8)
  colorized_depth = np.stack([depth, depth, depth], axis=-1)

  depth_image = Image.fromarray(colorized_depth)
  depth_image.save(fpath)
AndyCao1125 commented 1 week ago

Thanks for your prompt reply!

AndyCao1125 commented 3 days ago

Dear authors,

Thank you! It seems that this work adopts the depth images as input to a frozen image encoder that was pretrained on RGB data, and I wanna ask for some clarification on the preprocessing steps.

In particular, how should I normalize depth images? Since the encoder is frozen and designed for RGB inputs, is it important for the depth image values to be within a similar range as RGB images? Or is there a specific normalization strategy for depth data in this case?

Thanks!

AnjieCheng commented 2 days ago

Hi,

Yes, you’re right. We need to normalize the depth to match the RGB range, as shown in the save_raw_16bit function from the previous response:

depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0