facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.1k stars 884 forks source link

Performance on custom medical data #123

Open MariaKokshaikina opened 2 years ago

MariaKokshaikina commented 2 years ago

Hello, thank you for this great repository. I used it to test DINO’s performance in detecting Pneumothorax in chest x-rays and segmenting tissues in histopathological images. Together with my colleague we tried different hyper parameters and configurations, however, in all our experiments activation maps showed no signs of recognising the disease.

We published the results of our experiments in this blog - https://medium.com/@mllabucu/transformer-based-self-supervised-learning-for-medical-images-41395d069829

The tested changes include:

We wonder if the reasons for poor performance are hidden solely in the nature of our dataset (described in the blog) or is there anything we could have missed when training DINO.

woctezuma commented 2 years ago

Maybe relevant:

In this work we conduct large-scale pre-training on large source datasets of either natural (ImageNet-21k/1k) or medical chest X-Ray images and compare full and few-shot transfer using different target datasets from both natural and medical imaging domains. Our observations provide evidence that while pre-training and transfer on closely related datasets do show clear benefit of increasing model and data size during pre-training, such benefits are not clearly visible when source and target datasets are further apart. These observations hold across both full and few-shot transfer and indicate that scaling laws pointing to improvement of generalization and transfer with increasing model and data size are incomplete and should be revised by taking into account the type and proximity of the source and target data, to correctly predict the effect of model and data scale during pre-training on transfer.

from:

Cherti, Mehdi, and Jenia Jitsev. "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images." arXiv preprint arXiv:2106.00116 (2021). (code)

psteinb commented 2 years ago

Hi, great to see this post being circulated. :+1:

I also made experiments with DINO. In my case, I trained on material science images from a synchrotron light source.

Besides obstacles in making the code run and establishing stable training (not sure I understand why open PRs by the community are not being merged), I also made the observation that

Quoting from the paper's abstract:

first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets.

While the argument made here is true in a narrow sense from my point of view (the aforementioned semantic information might be somewhere in the trained net), I start to doubt this has any practical use or relevance with respect to trustworthiness of the predictions.

I'd appreciate more reports about what people do find using this approach or feedback on the statements above. Maybe it's the lack of inductive bias in my images? Or the size of the structures? :shrug: