DominikBatic / EndoViT

Large-scale Self-supervised Pre-training for Endoscopy
23 stars 1 forks source link

EndoViT

Large-scale Self-supervised Pre-training of Vision Transformers (ViT) on endoscopic images.


Official codebase of the paper: EndoViT: pretraining vision transformers on a large collection of endoscopic images

Earlier arXiv version (without semantic-segmentation) can be found here: Whether and When does Endoscopy Domain Pretraining Make Sense?

Authors: Dominik Batić, Felix Holm, Ege Özsoy, Tobias Czempiel, Nassir Navab

@article{batic2023whether,
  title={Whether and When does Endoscopy Domain Pretraining Make Sense?},
  author={Bati{\'c}, Dominik and Holm, Felix and {\"O}zsoy, Ege and Czempiel, Tobias and Navab, Nassir},
  journal={arXiv preprint arXiv:2303.17636},
  year={2023}
}

Quick-Start

Checkout out our 🤗 Hugging Face page for a guide on using EndoViT as Feature Extractor (Either Frozen or as a Backbone to be Fine-tuned). Alternatively you can take a look at the endovit_demo.py

Pre-trained EndoViT Checkpoints

To prevent data leakage for our evaluation, we excluded the test set for backbone training of our segmentation, action triplet recognition, and surgical phase recognition tasks, respectively. You can find the weights for each of these versions of the backbone below.

Excluded Data (Test Sets) Checkpoint
CholecSeg8k (Segmentation) EndoViT_Seg
CholecT45 (Action Triplet Detection) EndoViT ATD
Cholec80 (Surgical Phase Recognition) EndoViT_SPR

Use these checkpoints if you wish to skip EndoViT's pre-training.

Introduction

The development of novel Computer Vision (CV) methods in the medical field has been largely constrained by the lack of publicly available annotated data. Patient data and recorded surgical procedures are hard to obtain. They are considered highly sensitive information and therefore protected by numerous laws. Even the annotation procedure is complicated, often requiring the involvement of multiple medical experts.

Consequently, public medical datasets are scarce, and the existing ones contain far fewer annotated images than the CV datasets used for the same task. Pre-training has been shown as a viable strategy to mitigate the downsides of training on small datasets. However, most medical works use models pre-trained on natural images, creating a domain gap between pre-training and fine-tuning.

In this work, we explore the possibilities of pre-training models specifically for the use in endoscopic domain. To this end, we turn to Vision Transformers. Given the extreme number of parameters they contain, a large amount of data is needed to properly train them. Therefore, self-supervised pre-training strategies were developed, splitting the use of Transformers into two parts. First, a Transformer is pre-trained using a large collection of raw unlabelled data to produce a model with a general understanding of the underlying domain. Afterwards, the resulting model is fine-tuned for a specific downstream task. This can now be done with significantly less labelled data.

Project Description

The fact Vision Transformers can be pre-trained using raw data only prompted us to combine the existing smaller medical datasets into a larger collection. To this end, we introduce Endo700k, a collection of 9 publicly available endoscopic datasets comprising more than 700,000 unlabelled images. The overview of the included datasets is given in the table below.

Endo700k dataset collection

# Dataset # Images
1 HeiCo 347,257
2 Cholec80 184,498
3 PSI-AVA 73,618
4 ESAD 49,544
5 LapGyn4 (v1.2) 38,192
6 hSDB-instrument 35,576
7 DSAD 13,195
8 GLENDA (v1.0) 1,083
9 SurgicalActions160 761
- Total 743,724

Using Endo700k we pre-train a Vision Transformer model following Masked Autoencoder (MAE) approach. An input image is divided into equally-sized patches and a large proportion of them (75%) is masked out. The transformer is then tasked with reconstructing the missing input. Although a simple concept, it represents a challenging self-supervised task that induces a comprehensive understanding of observed objects and scenes. Afterwards, the pre-trained ViT model can be fine-tuned as a feature extraction backbone on various downstream tasks. We visualize the pre-training and fine-tuning procedure in the following image.

EndoViT_model

Finally, we evaluated EndoViT's performance on three downstream tasks:

We primarily compare EndoViT's performance to its ImageNet pre-trained ViT counterpart.


Usage

1) Clone the repository:

git clone https://github.com/DominikBatic/EndoViT.git endovit
cd endovit

3) Download Cholec80 (GitHub) (LICENSE) (Request Form)

4) Download and Prepare the Other Datasets

5) Prepare Cholec80

python ./datasets/Cholec80/prepare_cholec80.py

6) Download ImageNet Pre-trained Weights

a) ImageNet weights for pre-training (encoder-decoder weights):

wget -O ./pretraining/mae/ImageNet_pretrained_models/mae_pretrain_vit_base_full.pth https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base_full.pth

b) ImageNet weights for fine-tuning (encoder weights):

wget -O ./pretraining/mae/ImageNet_pretrained_models/mae_pretrain_vit_base.pth https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth

7) Run the pre-training

a) Run Pre-training for Semantic Segmentation

source ./pretraining/pretrained_endovit_models/EndoViT_for_Segmentation/pretrain_script

b) Run Pre-training for Action Triplet Detection

source ./pretraining/pretrained_endovit_models/EndoViT_for_ActionTripletDetection/pretrain_script

c) Run Pre-training for Surgical Phase Recognition

source ./pretraining/pretrained_endovit_models/EndoViT_for_SurgicalPhaseRecognition/pretrain_script

Pre-trained Checkpoints:



Fine-tune EndoViT:


Semantic Segmentation:

8) Download CholecSeg8k dataset (3 GB)

9) Pre-process CholecSeg8k dataset

10) Create a Relative Paths file (RP_file)

python ./datasets/CholecSeg8k/utils/create_RP_file_for_CholecSeg8k.py \
    --data_dir ./datasets/CholecSeg8k/data_preprocessed \
    --output_dir ./datasets/CholecSeg8k/data_preprocessed

11) Full Dataset Experiments

a) Low Res - EndoViT's pre-training resolution (224 x 224)

-------------- EndoViT -------------
source ./finetuning/semantic_segmentation/output_dir/low_res/full_dataset/EndoViT/hyperparam_script

------------- ImageNet -------------
source ./finetuning/semantic_segmentation/output_dir/low_res/full_dataset/ImageNet/hyperparam_script

----------- NoPretraining ----------
source ./finetuning/semantic_segmentation/output_dir/low_res/full_dataset/NoPretraining/hyperparam_script

b) High Res - resolution used in the benchmark paper (256 x 448)

-------------- EndoViT -------------
source ./finetuning/semantic_segmentation/output_dir/high_res/full_dataset/EndoViT/hyperparam_script

------------- ImageNet -------------
source ./finetuning/semantic_segmentation/output_dir/high_res/full_dataset/ImageNet/hyperparam_script

----------- NoPretraining ----------
source ./finetuning/semantic_segmentation/output_dir/high_res/full_dataset/NoPretraining/hyperparam_script

We report the following Full Dataset Results (metric: mean IoU):

| | ViT NoPretraining | ViT ImageNet | EndoViT | |:--------:|:------------------:|:------------------:|:------------------:| | Low Res | 51.70% ± 0.54% | 62.45% ± 0.90% | **65.05% ± 0.67%** | | High Res | 53.18% ± 1.20% | 63.40% ± 0.81% | **65.32% ± 0.56%** |

Qualitative Full Dataset Results:

Compared to the ImageNet pre-trained model, EndoViT has more globally consistent outputs (highlighted in black). Furthermore, it is significantly better at reconstructing instruments' tips (highlighted in red).

SegResultsImg1

Compared to the results introduced in the following benchmark, EndoViT outperforms other Transformers (UNETR) as well as various CNN architectures (including the best performing U-Net++).

SegResultsImg1

12) Few-shot Learning Experiments

a) Low Res - EndoViT's pre-training resolution (224 x 224)

-------------- EndoViT -------------
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/EndoViT/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/EndoViT/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/EndoViT/4_vids_only/hyperparam_script

------------- ImageNet -------------
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/ImageNet/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/ImageNet/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/ImageNet/4_vids_only/hyperparam_script

----------- NoPretraining ----------
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/NoPretraining/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/NoPretraining/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/NoPretraining/4_vids_only/hyperparam_script

b) High Res - resolution used in the benchmark paper (256 x 448)

-------------- EndoViT -------------
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/EndoViT/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/EndoViT/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/EndoViT/4_vids_only/hyperparam_script

------------- ImageNet -------------
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/ImageNet/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/ImageNet/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/ImageNet/4_vids_only/hyperparam_script

----------- NoPretraining ----------
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/NoPretraining/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/NoPretraining/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/NoPretraining/4_vids_only/hyperparam_script

We report the following Few-shot Learning Results (metric: mean IoU):

| **Low Res** |**ViT NoPretraining** | **ViT ImageNet** | **EndoViT** | |:----------------:|:--------------------:|:--------------------:|:--------------------:| | 1 Video Only | 29.11% ± 2.94% | 38.35% ± 8.27% | **40.95% ± 10.32%** | | 2 Videos Only | 36.28% ± 5.06% | 50.36% ± 2.71% | **54.02% ± 4.18%** | | 4 Videos Only | 43.29% ± 0.96% | 54.17% ± 2.35% | **57.87% ± 2.70%** | | **High Res** |**ViT NoPretraining** | **ViT ImageNet** | **EndoViT** | | 1 Video Only | 26.66% ± 6.64% | 39.06% ± 5.17% | **41.16% ± 10.75%** | | 2 Videos Only | 35.69% ± 4.45% | 50.14% ± 4.48% | **56.05% ± 5.73%** | | 4 Videos Only | 44.16% ± 0.75% | 56.22% ± 1.52% | **59.81% ± 3.27%** |

Action Triplet Detection:

13) Download CholetT45 dataset (150 GB)

14) Full Dataset Experiments

##########  ViT Backbone  ##########

-------------- EndoViT -------------
source ./finetuning/action_triplet_detection/output_dir/full_dataset/ViT_backbone/EndoViT/training_script

------------- ImageNet -------------
source ./finetuning/action_triplet_detection/output_dir/full_dataset/ViT_backbone/ImageNet/training_script

----------- NoPretraining ----------
source ./finetuning/action_triplet_detection/output_dir/full_dataset/ViT_backbone/NoPretraining/training_script

##########  CNN Backbone  ##########

------------- ResNet50 -------------
source ./finetuning/action_triplet_detection/output_dir/full_dataset/ResNet50_backbone/training_script

We report the following Full Dataset Results (metric: mean Average Precision - mAP):

| ResNet50 | ViT NoPretraining | ViT ImageNet | EndoViT | |:------------------:|:------------------:|:------------------:|:------------------:| | 22.13% ± 1.37% | 13.93% ± 0.43% | 27.84% ± 0.39% | **30.17% ± 0.01%** |

15) Few-shot Learning Experiments

##########  ViT Backbone  ##########

-------------- EndoViT -------------
source ./finetuning/action_triplet_detection/output_dir/less_training_data/ViT_backbone/EndoViT/training_script

------------- ImageNet -------------
source ./finetuning/action_triplet_detection/output_dir/less_training_data/ViT_backbone/ImageNet/training_script

----------- NoPretraining ----------
-> Since the results of training from scratch in Full Dataset Experiment were significantly worse, we skip this training.

##########  CNN Backbone  ##########

------------- ResNet50 -------------
source ./finetuning/action_triplet_detection/output_dir/less_training_data/ResNet50_backbone/training_script

We report the following Few-shot Learning Results (metric: mean Average Precision - mAP):

| |**ResNet50** | **ViT ImageNet** | **EndoViT** | |:----------------:|:--------------------:|:--------------------:|:--------------------:| | 2 Videos Only | 10.88% ± 0.50% | 12.22% ± 1.78% | **17.59% ± 2.94%** | | 4 Videos Only | 12.37% ± 1.78% | 14.27% ± 1.73% | **18.52% ± 2.28%** | | 8 Videos Only | 17.01% ± 1.75% | 19.71% ± 0.61% | **21.91% ± 0.12%** |

Surgical Phase Recognition:

16) Full Dataset Experiments

##########  ViT Backbone  ##########

-------------- EndoViT -------------
source ./finetuning/surgical_phase_recognition/output_dir/full_dataset/ViT_backbone/EndoViT/training_script

------------- ImageNet -------------
source ./finetuning/surgical_phase_recognition/output_dir/full_dataset/ViT_backbone/ImageNet/training_script

----------- NoPretraining ----------
source ./finetuning/surgical_phase_recognition/output_dir/full_dataset/ViT_backbone/NoPretraining/training_script

##########  CNN Backbone  ##########

------------- ResNet50 -------------
source ./finetuning/surgical_phase_recognition/output_dir/full_dataset/ResNet50_backbone/training_script

We report the following Full Dataset Results (metric: Phase Accuracy):

| | ResNet50 | ViT NoPretraining | ViT ImageNet | EndoViT | |:------------------:|:------------------:|:------------------:|:------------------:|:------------------:| | Stage 1 | 79.84% ± 0.30% | 59.21% ± 0.36% | **82.94% ± 0.69%** | 82.60% ± 1.26% | | Stage 2 | 87.84% ± 0.58% | 73.42% ± 0.70% | **89.56% ± 0.65%** | 89.37% ± 0.95% |

17) Few-shot Learning Experiments

##########  ViT Backbone  ##########

-------------- EndoViT -------------
source ./finetuning/surgical_phase_recognition/output_dir/less_training_data/ViT_backbone/EndoViT/training_script

------------- ImageNet -------------
source ./finetuning/surgical_phase_recognition/output_dir/less_training_data/ViT_backbone/ImageNet/training_script

----------- NoPretraining ----------
-> Since the results of training from scratch in Full Dataset Experiment were significantly worse, we skip this training.

##########  CNN Backbone  ##########

------------- ResNet50 -------------
source ./finetuning/surgical_phase_recognition/output_dir/less_training_data/ResNet50_backbone/training_script

We report the following Few-shot Learning Results (metric: Phase Accuracy):

| **Stage 1** |**ResNet50** | **ViT ImageNet** | **EndoViT** | |:----------------:|:--------------------:|:--------------------:|:-------------------:| | 2 Videos Only | 47.51% ± 1.33% | 63.59% ± 1.07% | **67.04% ± 2.92%** | | 4 Videos Only | 57.80% ± 2.67% | 67.72% ± 0.90% | **71.80% ± 0.49%** | | 8 Videos Only | 63.71% ± 1.48% | **75.50% ± 0.32%** | 75.30% ± 1.83% | | **Stage 2** |**ResNet50** | **ViT ImageNet** | **EndoViT** | | 2 Videos Only | 68.23% ± 1.10% | 77.05% ± 1.71% | **78.89% ± 1.26%** | | 4 Videos Only | 74.50% ± 1.76% | 80.00% ± 0.62% | **80.28% ± 0.71%** | | 8 Videos Only | 77.43% ± 1.68% | 84.10% ± 0.38% | **84.68% ± 1.25%** |