bowang-lab / MedSAM

Segment Anything in Medical Images
https://www.nature.com/articles/s41467-024-44824-z
Apache License 2.0
2.98k stars 414 forks source link

Question about internal validation protocol #192

Closed Zrrr1997 closed 10 months ago

Zrrr1997 commented 10 months ago

Dear authors,

thank you for the great effort towards a medical foundation model! I was wondering how you performed the internal validation. In the paper, I understand you split 80/10/10 for train/validation/test after gathering all the datasets from various modalities and tasks.

Do you then report the performance on the 10% test split as "internal validation", e.g. in Tables 5-8 in the supplementary? How do you compute this 10% random split - is it 10% of all aggregated datasets or 10% for each individual dataset and then combined together to form the final internal validation dataset?

I am wondering since I cannot see any results for the Abdomen-US dataset in Table 7. In Table 3 I can see it has 60 samples and several targets, but in Table 7 this could only be attributed to the "Kidney" row or it is not part of the results there. Can I assume that each row in table 7 aggregates all datasets with a certain target and modality? For example, the kidney row consists of results from both Abdomen-US and CT2US? Or is each row there computed only over one dataset?

Thanks in advance!

Best, Zdravko

JunMa11 commented 10 months ago

Hi @Zrrr1997 ,

Sorry for my late reply. Thank you for your insightful questions very much.

  1. For the internal validation set, we adopted a strategy where we randomly selected 10% of each individual dataset. These subsets were then aggregated to form the composite internal validation dataset. Specifically for endoscopy frames and pathology images, to prevent information leakage, the 10% validation subset was determined based on distinct video sequences or patients rather than individual images.

  2. Regarding the Abdomen-US dataset, we faced a challenge due to the limited number of samples available for each segmentation target (gallbladder: 12, kidney: 15, liver: 40, spleen: 6, vessel: 16). This small sample size rendered statistical significance testing for this dataset less reliable. Consequently, we used the dataset entirely in the training set, instead of splitting it for validation or testing. This is why its results are not explicitly reported in Table 7. Each row in Table 7 aggregates results from all datasets sharing a common target and modality.

I hope this clarifies your concerns. We appreciate your thorough review and are happy to provide any further clarifications needed.

BTW, we will be launching a challenge in CVPR about MedSAM on laptop. Welcome to join us:) https://www.codabench.org/competitions/1847/

Best regards, Jun

JunMa11 commented 10 months ago

We just released lite MedSAM here: https://github.com/bowang-lab/MedSAM/tree/LiteMedSAM

It is 10x faster than MedSAM. Any comments are welcome.

Zrrr1997 commented 9 months ago

Hi @JunMa11,

Thank you so much for the thorough explanation! I guess my initial guess was correct :) I suppose the goal was to show how MedSAM performs on a modality + target rather than on individual datasets. Thank you for sharing the challenge and the new Lite-MedSAM model. I am excited to try it out!

Best, Zdravko