BeileiCui / SurgicalDINO

[IPCAI'2024 (IJCARS special issue)] Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery
31 stars 2 forks source link

difference between Surgical-DINO SSL and Surgical-DINO Supervised #2

Closed RemaDaher closed 4 months ago

RemaDaher commented 4 months ago

Hello,

It is unclear to me in the paper and here what is the difference between Surgical-DINO SSL and Surgical-DINO Supervised. I assume when you say Surgical-DINO in the paper you mean the supervised one. What is unclear is which training technique and testing technique is used for both. when you say in the paper "trained our proposed model in a Self-Supervised Learning (SSL) manner with the baseline of AF-SfMLearner [17]. We replace the encoder in AF-SfMLearner with Surgical-DINO and resize the image to 224 × 224 pixels to fit the patch size of DINOv2" did you mean this only for Surgical-DINO SSL? if that is the case then how is Surgical-DINO trained? also some results in the readme here are different than the papers also the results show that in DINOv2 or DINOv2 finetuned or AF-SFMlearner sometimes outperform the Surgical-DINO SSL method. could you explain why or if the results are outdated in the paper?

Thank you !

BeileiCui commented 4 months ago

Hi thanks for your question. When I say "trained our proposed model…… patch size of DINOv2", Yes this means only for Surgical-DINO SSL. Surgical-DINO is trained with a supervised mannar on SCARED dataset with the losses provided in codes and paper. I noticed some small difference in results, please take the results in paper as confirmed one because I may have copied wrongly for the table here.

For the results, I did not find which in metrics, DINOv2(zero-shot) or AF-SfMLearner is better than SurgicalDINO SSL. DINOv2(fine-tuned) outperforms SurgicalDINO SSL in some metrics probably because: DINOv2(fine-tuned) is fully supervised learning where ground truth depth is involved for training. SurgicalDINO SSL only uses the reprojection difference as supervision signal for depth estimation while ground truth depth are not involved. So naturaly fully supervised way should have better perfromance over self-supervised way.