Closed nkinnaird closed 3 months ago
We highlighted in the paper that only the DINOv2-Giant-based teacher model is capable enough to generalize from synthetic images to real images, while other encoders cannot. So we use this teacher model for pseudo labeling.
Ahh okay, so you used the power of the largest encoder which was able to generalize. Thank you, appreciate the response.
Hi there, wonderful work and paper! Thank you for open-sourcing models.
I read the paper, and I had a question on the quality of the pseudo-labels used to train the student model. From reading the paper you've switched to training the teacher model on purely synthetic images in order to improve it's performance from perfect depth map labels. You mention in the paper that a model trained on such synthetic images would not be generalizable to real images for a variety of reasons. You then however use that teacher model to pseudo-label the many real images, and then train the student model on those pseudo-labeled images.
I don't quite understand then how the pseudo-labels are good enough to train the student model. If the teacher model is not generalizable to the real images, then wouldn't the quality of the pseudo-labels be too poor to train a well-performing student model?
The performance of the model is obviously very powerful, so I definitely think I've missed something, but wasn't able to find it in the paper. Perhaps it is obvious, but I'd love if you could share any insight.
Thanks!