Closed choyingw closed 7 months ago
For our foundation models, we normalize the images with ImageNet mean/std because the pre-trained encoder also normalize in this manner. As for the fine-tuned models, honestly, I did not specially take care of the mean and std. According to my experiments, both ImageNet mean/std and the 0.5
mean/std (by MiDaS and ZoeDepth) work well. The only important thing is using the same mean/std across training and inference.
I tried both the foundation model and the finetuned metric depth model. I found a slight difference where foundation models are trained and used to infer with color normalization applied to input images, but fine-tuned metric depth models are tuned/ used to infer without color normalization applied (i.e., only divide pixel values by 255.0). Why there is such difference?
Thanks for help!