Open saadehmd opened 2 months ago
Since some operations in the network involve splitting into patches and up/down sampling, you need to ensure that the input's width and height are divisible by 28.
First, the ViT encoder splits the image into 14*14 patches, so the width and height should be divisible by 14, otherwise they will be padded; Second, the encoded tokens will be further encoded into features at four resolutions of the original image: 1/14, 1/14, 1/7, 1/4, to provide to the RAFT decoder. Therefore, the padded width and height should also be divisible by 4, otherwise they will be truncated.
When your input size is (132,176), this is what happens: $\lfloor \lceil 132/14 \rceil 14 / 4 \rfloor 4 = 140$ $\lfloor \lceil 176/14 \rceil 14 / 4 \rfloor 4 = 180$
Therefore, the recommended practice is to always include RandomCrop
in the pipeline to ensure the input always complies (if the crop size is larger than the image size, padding will be applied; our implementation will automatically handle padding and unpadding).
Additionally, given that ViT is sensitive to input size, and our training uses (616, 1064) as input, if you plan to fine-tune based on the pre-trained model, it's recommended to maintain consistent input sizes for better performance.
Thanks, the random cropping resolves the problem. But the losses don't seem right, Specially this early during the training:-
BTW!! Can the algorithm even be trained without GT Normals and GT Semantic Masks ? Also, I'm training on images that are single channel (i.e. I just clone the single channel into RGB channels). Would it still be a good idea to fine-tune on the pre-trained model : ''metric_depth_vit_small_800k"(in this case) or to train from scratch
Nevermind. Was using the wrong clipping-range, so GT depth was all -1. Fixed that and now losses are non-zero. But other questions are still valid:-
- should i use fine-tuning or train from scratch.
- Is the absence of any GT normals or GT semantic masks ok?
- Would i get worse results using a greyscale image as RGB?
- Should i use the same data-normalization mean and std as used in most dataset configs: mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375]),
- Since, I have no sky masks, does it make sense to turn off the Sky Regularization?
- Should i use all the following recommended masks for ViT.Raft.small:-
Here are my thoughts:
DeNoConsistencyLoss
is worth considering and experimenting with.
- 我应该使用微调还是从头开始训练。
- 没有任何 GT 法线或 GT 语义掩码可以吗?
- 使用灰度图像作为 RGB 图像会得到更差的结果吗?
- 我是否应该使用与大多数数据集配置相同的数据标准化平均值和标准差:平均值=[123.675, 116.28, 103.53],标准差=[58.395, 57.12, 57.375]),
- 由于我没有天空蒙版,关闭天空规则化是否有意义?
- 我是否应该对 ViT.Raft.small 使用以下所有推荐的掩码:-
以下是我的想法:
- 我建议您先尝试微调预训练模型。您可以参考 Metric3Dv2 论文附录第 1.2 节中的微调协议。
- GT 语义蒙版用于天空正则化,GT 法线用于优化法线。如果您没有 GT 语义蒙版和法线,也没关系;训练管道只会优化深度。但是,如果所有训练数据都没有 GT 法线,请考虑是否有必要使用深度法线一致性损失进行联合优化。您应该进行实验以找到最佳实践。
- 我认为应该如此。
- 是的,如果您使用 RGB 输入。
- 如果您没有天空遮罩,无论您是否启用它,都不会对天空进行额外的规则化,并且此损失将为 0。
- 配置看起来还不错。但是,正如 2 中讨论的那样,是否使用
DeNoConsistencyLoss
值得考虑和试验。
@ZachL1 Hello, Thank you members for answering my questions earlier, the data structure requirements of kitti data given by your training example, I want to fine-tune your model to suit my use case:
@ZachL1 So, after training for 30 epochs on the dataset i have, i do get a decent drop in loss apparently.
I get roughly the same accuracy on test images. But then i save the output images using 'do_test.py' and i get some output like this:- Top: Greyscale immage, Middle: Prediction, bottom: GT depth.
I looked at the ranges of values in GT and predicted depth. It seems like the network takes as input, GT depth normalized between 0 and 1 while outputs a prediction which is normalized between 0 and 200. Does this immediately indicate some mistake i might be making, or is this by design supposed to be.
Hi, @oywenjun11
BaseDataset
. You can check how it loads data and inherit from it to implement your own Dataset
class. If you do this, you can refer to Matterport3DDataset
or any other derived class. Basically, you just need to ensure that load_batch()
loads data as expected (note that you need to appropriately override some methods).
Alternatively, you can build your own Dataset
from scratch in the way you prefer.Hi, @saadehmd
I think this might primarily be a visualization issue. The lower left corner seems to be incorrectly predicted as sky. Referring to some existing discussions, you can try clipping the predicted depth first, and then see how the visualization results look.
... the visualization doesn't look very good due to the very large depth values in the sky. You can set the part exceeding a certain threshold to zero before color mapping. The pre-trained Metric3D should predict the sky depth as 200m or 150m as I recall, and in general setting the threshold to 100m should work for most scenes. Originally posted by @ZachL1 in https://github.com/YvanYin/Metric3D/issues/109#issuecomment-2159812386
@ZachL1 Thank you so much for taking the time out of your busy schedule to answer my questions, Solved my doubts in this regard.I have recently started using my own dataset to try to fine-tune your model, and I have the following questions:
Hi, @oywenjun11
Hi, Im trying to use your model for outdoor scene to get metric depth to eventually get a sense of scale at the specific location, but the depth values are too varied, can you help me with this more details here
@ZachL1 Are we supposed to leave the 'depth_normalize' to it's default: (0.1, 200) for custom datasets? Or is this something we should change according to the expected depth range of the custom data ?
I did manage to finally train it on my dataset without any scale ambiguities on the output and tried to visualize data as pointcloud (instead of depth images) to make more sense. In the GIF attached. The green ones are the GT points in one of the test samples and the red ones are predicted points. While there's quite a good overlap between the True positives i.e.; The plane patches in the GT are also present in the predicted cloud, there's also a lot of noise i.e. scattered False-positive points around those patches.
I am thinking following could be the reasons (what are your suggestions on these?):-
Since i am not actually training on RGB images but instead a single channel Gray image the mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375] are probably creating problem instead of properly normalizing the three channels. (do you think skipping this normalization of RGB would help?)
Maybe more epochs of training and with the larger ViT might help.
I have removed most all the resize, crop, distortion augmentations from current training, just kept the bare minimum for canonical resize and padding. Maybe training with more augmentations helps ?
Also, i don't seem to understand why it's predicting non-zero depths for areas in the image that are totally blacked out.
Trained again by masking out the dark regions in the rgb image and applying some outlier removal to the pointcloud (projected from predicted depth image) and it looks much better now. Thanks for all the help :)
Hi, Thanks a lot for sharing your work and instructions for training and inference. I have been trying to train 'dino_vit_small_reg.dpt_raft' on a mini-dataset of my own. I basically modeled it almost like KITTI, except ofcourse the differences in original image size, focal_length, metric_scale. I also don't have any semantic maps so those are basically just left empty. There are also no normals pre-calculated so those are just arrays of zeros(the way the base dataset initializes them from NONE)
This is the relevant part of dataset config:- The depthmap i use basically has all the depth values in metric space and 8m is the max detection range
The process_depth part:- I clip between 0.3 - 3.5 m and normalize to (0 - 1) for canonical transformation.
Now i get that this might not be the most desirable dataset with such small sized images, so i wasn't expecting any impressive results. But atleast i was expecting to train on these without introducing any errors in data-loading/ pre-processing. I can't figure out why but a single FORWARD() pass thru the network generates predictions that are not of the same shape as GroundTruth and so the loss calculation and hence the training basically fails at the very beginning.
Here's a bunch of debug prints from prediction and GT shapes:-
I have also tried shutting down almost all the augmentations except HorizontalFlip:-