Closed ckLibra closed 4 years ago
Your surface is highly non lambertian which can itslef explain the poor convergence. Also the occlusion areas are not great for convenience, and texture is somewhat very self similar, making it possible to get stuck on a local minimum
Other than that, I never played with that kind of dataset so there could be multiple reasons.
Thanks for replying.What you have mentioned really make sense.I have made an ablation study that proved explanability mask is really import for my dataset. But I think that in this dataset,the light and shade in the distance and near are very different(just take the above picture to say,the center is darker than the edge).
I just wonder why the network can't learn this feature to estimate the depth.
Hi, sorry about the late answer.
a few advices to make it work on your data : dismiss the black corners, it will only add noise to your depth, because it's moving with the camera. It should be fairly easy, by modifying the function here : https://github.com/ClementPinard/SfmLearner-Pytorch/blob/master/inverse_warp.py#L186 Instead of outside of the regular image crop (x and y are outside of [-1,1]), check that warped pixels are outside of valid image area that would take your round corners into acount.
Another thing that you should take into acount is the reflected light. Again, the colors induced are not consistent with camera dispalcement and depth, and thus should be ignored. For each image, try to detect reflected light. The simplest classifier that comes to mind is to dismiss point that are too bright, as they probable come from reflected light and won't have much texture anyway.
Lastly, you need to take occlusion into acount. The code from this repo, apart from the explainability mask explicitely doesn't. You can try to adapth the smooth loss to a more edge-aware smooth loss. Very contrastive areas have greater chances to be occlusions, because color distribution change brutally from one ground to another. see the one they use in monodepth2, among others.
If none of this worked, you can try a totally different method like colmap to get the depth maps. It's not to say only colmap will work, but as colmap is way more sophisticated than current autosupervised methods like this one (at the pirce of extended computing time), if it doesn't, nothing will.
Clément
Thanks a lot for your advices.It's really helpful.I have tried these tricks and evaluated the model.The edge-aware smooth can really improve the result while other two not(maybe explainability mask have the similar function as these two).But since the baseline doesn't achieve high acc, the model is not good enough for application.I will try more SOTA monocular depth estimation model. Really appreciate your sharing and work!
You might want to contact Josué Ruano, who happened to work in my lab on similar problems of depth estimation for colonoscopy videos. https://www.researchgate.net/profile/Josue_Ruano
IIRC, he ended up constructing a synthetic dataset for supervised learning with blender, and it worked nicely on most videos.
Clément
Thanks for your information!
Like the following picture.The network seems to learn the structure well,but the distance is not learned well.The depth in the distance seems to be similar to that in the near place.What's the reason that might influence it? I use the pretrained model to finetune.Hyperparameter is all default except m=0.2 and epochs=20. And I add an corner mask for photometric_reconstruction_loss since the four corner in my dataset is all black and meaningless.