YvanYin / Metric3D

The repo for "Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image" and "Metric3Dv2: A Versatile Monocular Geometric Foundation Model..."
https://jugghm.github.io/Metric3Dv2/
BSD 2-Clause "Simplified" License
1.41k stars 106 forks source link

Prediction shape mismatch with GroundTruth #156

Open saadehmd opened 2 months ago

saadehmd commented 2 months ago

Hi, Thanks a lot for sharing your work and instructions for training and inference. I have been trying to train 'dino_vit_small_reg.dpt_raft' on a mini-dataset of my own. I basically modeled it almost like KITTI, except ofcourse the differences in original image size, focal_length, metric_scale. I also don't have any semantic maps so those are basically just left empty. There are also no normals pre-calculated so those are just arrays of zeros(the way the base dataset initializes them from NONE)

This is the relevant part of dataset config:- image The depthmap i use basically has all the depth values in metric space and 8m is the max detection range

The process_depth part:- image I clip between 0.3 - 3.5 m and normalize to (0 - 1) for canonical transformation.

Now i get that this might not be the most desirable dataset with such small sized images, so i wasn't expecting any impressive results. But atleast i was expecting to train on these without introducing any errors in data-loading/ pre-processing. I can't figure out why but a single FORWARD() pass thru the network generates predictions that are not of the same shape as GroundTruth and so the loss calculation and hence the training basically fails at the very beginning.

Here's a bunch of debug prints from prediction and GT shapes:- image

I have also tried shutting down almost all the augmentations except HorizontalFlip:- image

ZachL1 commented 1 month ago

Since some operations in the network involve splitting into patches and up/down sampling, you need to ensure that the input's width and height are divisible by 28.

First, the ViT encoder splits the image into 14*14 patches, so the width and height should be divisible by 14, otherwise they will be padded; Second, the encoded tokens will be further encoded into features at four resolutions of the original image: 1/14, 1/14, 1/7, 1/4, to provide to the RAFT decoder. Therefore, the padded width and height should also be divisible by 4, otherwise they will be truncated.

When your input size is (132,176), this is what happens: $\lfloor \lceil 132/14 \rceil 14 / 4 \rfloor 4 = 140$ $\lfloor \lceil 176/14 \rceil 14 / 4 \rfloor 4 = 180$

Therefore, the recommended practice is to always include RandomCrop in the pipeline to ensure the input always complies (if the crop size is larger than the image size, padding will be applied; our implementation will automatically handle padding and unpadding). Additionally, given that ViT is sensitive to input size, and our training uses (616, 1064) as input, if you plan to fine-tune based on the pre-trained model, it's recommended to maintain consistent input sizes for better performance.

saadehmd commented 1 month ago

Thanks, the random cropping resolves the problem. But the losses don't seem right, Specially this early during the training:- image

BTW!! Can the algorithm even be trained without GT Normals and GT Semantic Masks ? Also, I'm training on images that are single channel (i.e. I just clone the single channel into RGB channels). Would it still be a good idea to fine-tune on the pre-trained model : ''metric_depth_vit_small_800k"(in this case) or to train from scratch

saadehmd commented 1 month ago

Nevermind. Was using the wrong clipping-range, so GT depth was all -1. Fixed that and now losses are non-zero. But other questions are still valid:-

  1. should i use fine-tuning or train from scratch.
  2. Is the absence of any GT normals or GT semantic masks ok?
  3. Would i get worse results using a greyscale image as RGB?
  4. Should i use the same data-normalization mean and std as used in most dataset configs: mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375]),
  5. Since, I have no sky masks, does it make sense to turn off the Sky Regularization?
  6. Should i use all the following recommended masks for ViT.Raft.small:- image
ZachL1 commented 1 month ago
  1. should i use fine-tuning or train from scratch.
  2. Is the absence of any GT normals or GT semantic masks ok?
  3. Would i get worse results using a greyscale image as RGB?
  4. Should i use the same data-normalization mean and std as used in most dataset configs: mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375]),
  5. Since, I have no sky masks, does it make sense to turn off the Sky Regularization?
  6. Should i use all the following recommended masks for ViT.Raft.small:- image

Here are my thoughts:

  1. I suggest you try fine-tuning the pre-trained model first. You can refer to the finetuning protocols in Section 1.2 of the Metric3Dv2 paper appendix.
  2. The GT semantic mask is used for sky regularization, and GT normals are used for optimizing normals. It's okay if you don't have GT semantic masks and normals; the training pipeline will only optimize depth. However, if none of all training data has GT normals, consider whether it's necessary to use depth-normal consistency loss for joint optimization. You should experiment to find the best practice.
  3. I think it should be.
  4. Yes, if you use RGB input.
  5. If you don't have sky masks, whether you enable it or not, there will be no additional regularization of the sky, and this loss will be 0.
  6. The configuration looks okay. However, as discussed in 2, whether to use DeNoConsistencyLoss is worth considering and experimenting with.
oywenjun11 commented 1 month ago
  1. 我应该使用微调还是从头开始训练。
  2. 没有任何 GT 法线或 GT 语义掩码可以吗?
  3. 使用灰度图像作为 RGB 图像会得到更差的结果吗?
  4. 我是否应该使用与大多数数据集配置相同的数据标准化平均值和标准差:平均值=[123.675, 116.28, 103.53],标准差=[58.395, 57.12, 57.375]),
  5. 由于我没有天空蒙版,关闭天空规则化是否有意义?
  6. 我是否应该对 ViT.Raft.small 使用以下所有推荐的掩码:- 图像

以下是我的想法:

  1. 我建议您先尝试微调预训练模型。您可以参考 Metric3Dv2 论文附录第 1.2 节中的微调协议。
  2. GT 语义蒙版用于天空正则化,GT 法线用于优化法线。如果您没有 GT 语义蒙版和法线,也没关系;训练管道只会优化深度。但是,如果所有训练数据都没有 GT 法线,请考虑是否有必要使用深度法线一致性损失进行联合优化。您应该进行实验以找到最佳实践。
  3. 我认为应该如此。
  4. 是的,如果您使用 RGB 输入。
  5. 如果您没有天空遮罩,无论您是否启用它,都不会对天空进行额外的规则化,并且此损失将为 0。
  6. 配置看起来还不错。但是,正如 2 中讨论的那样,是否使用DeNoConsistencyLoss值得考虑和试验。

@ZachL1 Hello, Thank you members for answering my questions earlier, the data structure requirements of kitti data given by your training example, I want to fine-tune your model to suit my use case:

  1. I want to know if I have GT- depth map, GT-RGB image and GT- surface normal map in my dataset, what kind of dataset should I refer to for fine-tuning;
  2. I want to try to debug the scene in a bird's eye view( (because you said that the model lacks a bird's eye view), like the corner view of an elevator, I don't know if I still need a GT -surface normal map. Thank you in advance for your answer, thank you!
saadehmd commented 1 month ago

@ZachL1 So, after training for 30 epochs on the dataset i have, i do get a decent drop in loss apparently. image

I get roughly the same accuracy on test images. But then i save the output images using 'do_test.py' and i get some output like this:- 20211206_083506_140_HSC14_40_front_merge Top: Greyscale immage, Middle: Prediction, bottom: GT depth.

I looked at the ranges of values in GT and predicted depth. It seems like the network takes as input, GT depth normalized between 0 and 1 while outputs a prediction which is normalized between 0 and 200. Does this immediately indicate some mistake i might be making, or is this by design supposed to be.

ZachL1 commented 1 month ago

Hi, @oywenjun11

  1. All dataset classes inherit from BaseDataset. You can check how it loads data and inherit from it to implement your own Dataset class. If you do this, you can refer to Matterport3DDataset or any other derived class. Basically, you just need to ensure that load_batch() loads data as expected (note that you need to appropriately override some methods). Alternatively, you can build your own Dataset from scratch in the way you prefer.
  2. As I mentioned in my previous comment, GT normal is not necessary; you can optimize the model by directly relying on GT depth.

Hi, @saadehmd

I think this might primarily be a visualization issue. The lower left corner seems to be incorrectly predicted as sky. Referring to some existing discussions, you can try clipping the predicted depth first, and then see how the visualization results look.

... the visualization doesn't look very good due to the very large depth values in the sky. You can set the part exceeding a certain threshold to zero before color mapping. The pre-trained Metric3D should predict the sky depth as 200m or 150m as I recall, and in general setting the threshold to 100m should work for most scenes. Originally posted by @ZachL1 in https://github.com/YvanYin/Metric3D/issues/109#issuecomment-2159812386

oywenjun11 commented 1 month ago

@ZachL1 Thank you so much for taking the time out of your busy schedule to answer my questions, Solved my doubts in this regard.I have recently started using my own dataset to try to fine-tune your model, and I have the following questions:

  1. The data I use is synthetic data, so the RGB image of the dataset is not real enough, of course, the depth image is more standard,I don't know if this kind of RGB image has a big impact on my fine-tuning model.
  2. When I was training in the format of the kitti dataset with the dataset I described above, I first tried to train the pre-trained model of large(vit) (about 1.7G), max_iters=40000, I found that the size of the model in ckpt changed to 4.7G, the size of the change was more than I imagined, I want to know if this is due to a problem with the parameter settings of my dataset, or because I have too many training rounds.
  3. The result of the pre-trained model of the large is fine-tuned by inference, and the effect is far worse than your original small model, I wonder if this is due to the poor quality of the RGB images in my dataset or the number is not enough (about 1k images) (virtual synthetic data), or it has actually been fitted. because I found that the loss will bounce back and forth in the later stage of training. image The good thing is to use the large and small pre-trained model, and the bad is to use the fine-tuned model (large),RGB image shows a person standing outside the door, holding a calibration board. 2_small 2_large 2_step00040010 Thank you very much for answering my newbie questions; I don't know if I can add your contact information, if you agree, you can send your contact information to my email (1335840035@qq.com) or keep in touch by email. Thank you in advance for answering my question, thank you
ZachL1 commented 1 month ago

Hi, @oywenjun11

  1. Synthetic data is also okay;
  2. The model weights will not increase during training; they should always remain constant;
  3. I will contact you via email for more details.
MoAbbasid commented 1 month ago

Hi, Im trying to use your model for outdoor scene to get metric depth to eventually get a sense of scale at the specific location, but the depth values are too varied, can you help me with this more details here

saadehmd commented 1 month ago

@ZachL1 Are we supposed to leave the 'depth_normalize' to it's default: (0.1, 200) for custom datasets? Or is this something we should change according to the expected depth range of the custom data ? image

saadehmd commented 1 month ago

metric3d

I did manage to finally train it on my dataset without any scale ambiguities on the output and tried to visualize data as pointcloud (instead of depth images) to make more sense. In the GIF attached. The green ones are the GT points in one of the test samples and the red ones are predicted points. While there's quite a good overlap between the True positives i.e.; The plane patches in the GT are also present in the predicted cloud, there's also a lot of noise i.e. scattered False-positive points around those patches.

I am thinking following could be the reasons (what are your suggestions on these?):-

  1. Since i am not actually training on RGB images but instead a single channel Gray image the mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375] are probably creating problem instead of properly normalizing the three channels. (do you think skipping this normalization of RGB would help?)

  2. Maybe more epochs of training and with the larger ViT might help.

  3. I have removed most all the resize, crop, distortion augmentations from current training, just kept the bare minimum for canonical resize and padding. Maybe training with more augmentations helps ?

image Also, i don't seem to understand why it's predicting non-zero depths for areas in the image that are totally blacked out.

saadehmd commented 1 month ago

image Trained again by masking out the dark regions in the rgb image and applying some outlier removal to the pointcloud (projected from predicted depth image) and it looks much better now. Thanks for all the help :)