KaiqiangXiong / CL-MVSNet

[ICCV2023] CL-MVSNet: Unsupervised Multi-view Stereo with Dual-level Contrastive Learning
MIT License
39 stars 3 forks source link

Why only training on DTU dataset #3

Closed CanCanZeng closed 4 months ago

CanCanZeng commented 4 months ago

Hi, thank you for sharing this great work! I'm new to ML based MVS, so my question may sounds a bit stupid. Since unsupervised learning methods have been adopted, why only train on the DTU dataset instead of expanding the range of training data as much as possible. I thought researching unsupervised learning methods was to obtain more training data at a low cost and enhance the model's generalization ability. The advantages of unsupervised learning methods cannot be demonstrated through training on the DTU dataset alone. Looking forward to your guidance. Thank you in advance.

KaiqiangXiong commented 4 months ago

Thanks for your attention. Currently, the unsupervised MVS community has not extensively discussed the specific issue of training on a wider range of data to enhance generalization. Here, I would like to share some of my own viewpoints. Unsupervised MVS methods often rely on photometric consistency loss, which can be problematic in scenes with occlusions and reflections. These limitations can mislead the model when using supervised signals. Considering this issue, introducing more unlabeled data may not necessarily yield significant improvements. I have tried to fine-tune my DTU-trained model on the BlendedMVS dataset but the performance improvement is very limited. I guess the reason is that BlendedMVS contains larger scenes with more occlusions and reflections compared to the small object-level scenes in DTU. Consequently, such training may not help the model much. Based on these observations, maybe simply introducing a large amount of unlabeled data may not lead to noticeable performance improvements in unsupervised MVS. The limitation of photometric consistency loss is still the main bottleneck of the unsupervised MVS. I hope this captures the essence of your points. If there's anything else you'd like to add or if you have further questions, please let me know.

CanCanZeng commented 4 months ago

I understand what you mean now. Thank you for sharing your insights. I have also considered the issue that the premise of photometric consistency does not hold in many cases. I am wondering if it is possible to use the feature distance of deep learning as loss instead of photometric error. On the other hand, while predicting depth maps, a occlusion map is also predicted to handle occlusion issues. But I haven't figured out how to design the loss of occlusion map yet.

KaiqiangXiong commented 4 months ago

I believe this is a meaningful improvement direction. It is worth noting that the feature distance of deep learning as loss (feature metric loss) has been applied in KD-MVS. Clues for occlusion have also been discussed in some works(e.g., Pvsnet, Vis-MVSNet). Perhaps you can further explore and make improvements based on those works.

CanCanZeng commented 4 months ago

Your guidance is really helpful, thank you very much!

zhz120 commented 4 months ago

Thanks for your attention. Currently, the unsupervised MVS community has not extensively discussed the specific issue of training on a wider range of data to enhance generalization. Here, I would like to share some of my own viewpoints. Unsupervised MVS methods often rely on photometric consistency loss, which can be problematic in scenes with occlusions and reflections. These limitations can mislead the model when using supervised signals. Considering this issue, introducing more unlabeled data may not necessarily yield significant improvements. I have tried to fine-tune my DTU-trained model on the BlendedMVS dataset but the performance improvement is very limited. I guess the reason is that BlendedMVS contains larger scenes with more occlusions and reflections compared to the small object-level scenes in DTU. Consequently, such training may not help the model much. Based on these observations, maybe simply introducing a large amount of unlabeled data may not lead to noticeable performance improvements in unsupervised MVS. The limitation of photometric consistency loss is still the main bottleneck of the unsupervised MVS. I hope this captures the essence of your points. If there's anything else you'd like to add or if you have further questions, please let me know.

The latest traditional methods perform well in challenging scenarios, so I am currently pessimistic about the unsupervised MVS field.