Validation Accuracy on UNETR and Swin UNETR

Project-MONAI / research-contributions

Implementations of recent research prototypes/demonstrations using MONAI.

https://monai.io/

Apache License 2.0

995 stars 328 forks source link

Validation Accuracy on UNETR and Swin UNETR #69

Closed hanoonaR closed 2 years ago

hanoonaR commented 2 years ago

Hi, Thank you for sharing our wonderful work. Could you please provide some clarity on the evaluation accuracy on the BTCV validation dataset.

1) For UNETR when the pretrained models provided are evaluated on the BTCV validation set, the accuracy is 77.64. While, the testing accuracy reported in paper is 85.3. Is this huge gap expected and is it because of the difference in validation and test datasets?

2) For SWIN UNETR, when the pretrained models are evaluated on the BTCV validation set, the accuracy is 81.56. The table given in the Swin UNETR README report 81.86 as compared to 91.8 test accuracy reported in paper. Are the numbers in the table in README the validation accuracy?

Thank you.

lbf4616 commented 2 years ago

Similar issue in another dataset (similar to BTCV), these two models are even worse than 3dunet.

ahatamiz commented 2 years ago

Hi @hanoonaR

Thanks for the comments. The validation accuracy reported in the repo denotes for a single fold and by splitting the publicly available training data. For the leaderboard however, following the approach by previous SOTA for both models, we have used an ensemble of 20 models and extra private training data which increases the number of samples to 80 volumes. This is described in section 4.3 of the UNETR paper.

I hope this addresses the concerns.

Best,

ahatamiz commented 2 years ago

Hi @lbf4616

Thanks for the comment. Based on our experience for both models, each task requires careful tuning to achieve optimal performance. For instance, we note that choice of optimizer, which is AdamW in our case, the learning rate scheduler and number of epochs could play an important role to achieve the best performance.

Overall, in most MSD tasks and BTCV, our model consistently outperformed unet or nnunet, if carefully tuned.

Best,