Large-scale Self-supervised Pre-training of Vision Transformers (ViT) on endoscopic images.
Official codebase of the paper: EndoViT: pretraining vision transformers on a large collection of endoscopic images
Earlier arXiv version (without semantic-segmentation) can be found here: Whether and When does Endoscopy Domain Pretraining Make Sense?
Authors: Dominik Batić, Felix Holm, Ege Özsoy, Tobias Czempiel, Nassir Navab
@article{batic2023whether,
title={Whether and When does Endoscopy Domain Pretraining Make Sense?},
author={Bati{\'c}, Dominik and Holm, Felix and {\"O}zsoy, Ege and Czempiel, Tobias and Navab, Nassir},
journal={arXiv preprint arXiv:2303.17636},
year={2023}
}
Checkout out our 🤗 Hugging Face page for a guide on using EndoViT as Feature Extractor (Either Frozen or as a Backbone to be
Fine-tuned).
Alternatively you can take a look at the endovit_demo.py
To prevent data leakage for our evaluation, we excluded the test set for backbone training of our segmentation, action triplet recognition, and surgical phase recognition tasks, respectively. You can find the weights for each of these versions of the backbone below.
Excluded Data (Test Sets) | Checkpoint |
---|---|
CholecSeg8k (Segmentation) | EndoViT_Seg |
CholecT45 (Action Triplet Detection) | EndoViT ATD |
Cholec80 (Surgical Phase Recognition) | EndoViT_SPR |
Use these checkpoints if you wish to skip EndoViT's pre-training.
The development of novel Computer Vision (CV) methods in the medical field has been largely constrained by the lack of publicly available annotated data. Patient data and recorded surgical procedures are hard to obtain. They are considered highly sensitive information and therefore protected by numerous laws. Even the annotation procedure is complicated, often requiring the involvement of multiple medical experts.
Consequently, public medical datasets are scarce, and the existing ones contain far fewer annotated images than the CV datasets used for the same task. Pre-training has been shown as a viable strategy to mitigate the downsides of training on small datasets. However, most medical works use models pre-trained on natural images, creating a domain gap between pre-training and fine-tuning.
In this work, we explore the possibilities of pre-training models specifically for the use in endoscopic domain. To this end, we turn to Vision Transformers. Given the extreme number of parameters they contain, a large amount of data is needed to properly train them. Therefore, self-supervised pre-training strategies were developed, splitting the use of Transformers into two parts. First, a Transformer is pre-trained using a large collection of raw unlabelled data to produce a model with a general understanding of the underlying domain. Afterwards, the resulting model is fine-tuned for a specific downstream task. This can now be done with significantly less labelled data.
The fact Vision Transformers can be pre-trained using raw data only prompted us to combine the existing smaller medical datasets into a larger collection. To this end, we introduce Endo700k, a collection of 9 publicly available endoscopic datasets comprising more than 700,000 unlabelled images. The overview of the included datasets is given in the table below.
# | Dataset | # Images |
---|---|---|
1 | HeiCo | 347,257 |
2 | Cholec80 | 184,498 |
3 | PSI-AVA | 73,618 |
4 | ESAD | 49,544 |
5 | LapGyn4 (v1.2) | 38,192 |
6 | hSDB-instrument | 35,576 |
7 | DSAD | 13,195 |
8 | GLENDA (v1.0) | 1,083 |
9 | SurgicalActions160 | 761 |
- | Total | 743,724 |
Using Endo700k we pre-train a Vision Transformer model following Masked Autoencoder (MAE) approach. An input image is divided into equally-sized patches and a large proportion of them (75%) is masked out. The transformer is then tasked with reconstructing the missing input. Although a simple concept, it represents a challenging self-supervised task that induces a comprehensive understanding of observed objects and scenes. Afterwards, the pre-trained ViT model can be fine-tuned as a feature extraction backbone on various downstream tasks. We visualize the pre-training and fine-tuning procedure in the following image.
Finally, we evaluated EndoViT's performance on three downstream tasks:
We primarily compare EndoViT's performance to its ImageNet pre-trained ViT counterpart.
1) Clone the repository:
git clone https://github.com/DominikBatic/EndoViT.git endovit
cd endovit
2) Copy our "endovit" conda environment:
conda env create -f conda_environment.yml
conda activate endovit
3) Download Cholec80 (GitHub) (LICENSE) (Request Form)
python ./datasets/Cholec80/download_cholec80.py --data_rootdir ./datasets/
./datasets/Cholec80
.4) Download and Prepare the Other Datasets
python ./datasets/Endo700k/download_and_prepare_Endo700k.py --all --synapse_email YOUR_EMAIL --synapse_password YOUR_PASSWORD
./datasets/Endo700k
.5) Prepare Cholec80
python ./datasets/Cholec80/prepare_cholec80.py
./datasets/Endo700k
under: Cholec80_for_Segmentation
, Cholec80_for_ActionTripletDetection
and Cholec80_for_SurgicalPhaseRecognition
../datasets/validation_dataset/Cholec80_for_Validation
.6) Download ImageNet Pre-trained Weights
a) ImageNet weights for pre-training (encoder-decoder weights):
wget -O ./pretraining/mae/ImageNet_pretrained_models/mae_pretrain_vit_base_full.pth https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base_full.pth
b) ImageNet weights for fine-tuning (encoder weights):
wget -O ./pretraining/mae/ImageNet_pretrained_models/mae_pretrain_vit_base.pth https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth
7) Run the pre-training
a) Run Pre-training for Semantic Segmentation
source ./pretraining/pretrained_endovit_models/EndoViT_for_Segmentation/pretrain_script
./pretraining/pretrained_endovit_models/EndoViT_for_Segmentation/endovit_seg.pth
b) Run Pre-training for Action Triplet Detection
source ./pretraining/pretrained_endovit_models/EndoViT_for_ActionTripletDetection/pretrain_script
./pretraining/pretrained_endovit_models/EndoViT_for_ActionTripletDetection/endovit_ATD.pth
c) Run Pre-training for Surgical Phase Recognition
source ./pretraining/pretrained_endovit_models/EndoViT_for_SurgicalPhaseRecognition/pretrain_script
./pretraining/pretrained_endovit_models/EndoViT_for_SurgicalPhaseRecognition/endovit_SPR.pth
7) Run the pre-training
.6) Download ImageNet Pre-trained Weights
.EndoViT
), ImageNet weights (ImageNet
) and when training from stratch (NoPretraining
). Additionally, for Action Triplet Detection and Surgical Phase Reconition, we test the performance of a ResNet50 backbone pre-trained on ImageNet.Full Dataset Experiments
). We then assess their performance when trained on only a subset of the train set (Few-shot Learning Experiments
).8) Download CholecSeg8k dataset (3 GB)
archive.zip
. Rename it to CholecSeg8k.zip
and place it at ./datasets/CholecSeg8k
../datasets/CholecSeg8k/data
with the following command:
unzip -uqq ./datasets/CholecSeg8k/CholecSeg8k.zip -d ./datasets/CholecSeg8k/data
9) Pre-process CholecSeg8k dataset
We wrote a short description of the CholecSeg8k structure here. We also discuss all the pre-processing details in it.
In short: a) We correct some mistakes in the dataset. b) Instead of using the original watershed and colour masks provided by CholecSeg8k, we create an additional "ground truth mask". c) To compare our results to other architectures, we follow the pre-processing and training procedure of this benchmark. Most importantly, we downsample the original 13 classes to 8 by combining several classes into one.
Pre-process CholecSeg8k by running:
python ./datasets/CholecSeg8k/utils/preprocess_CholecSeg8k_multi_process.py \
--data_dir ./datasets/CholecSeg8k/data \
--output_dir ./datasets/CholecSeg8k/data_preprocessed \
--cpu_count 8
Since the pre-processing takes a long time, we wrote a multi-process script. --cpu_count
is the number of processes to spawn.
10) Create a Relative Paths file (RP_file)
.csv
file with relative paths from the CholecSeg8k's root folder to each of its images.python ./datasets/CholecSeg8k/utils/create_RP_file_for_CholecSeg8k.py \
--data_dir ./datasets/CholecSeg8k/data_preprocessed \
--output_dir ./datasets/CholecSeg8k/data_preprocessed
.csv
file can now be found at: ./datasets/CholecSeg8k/data_preprocessed/RP_CholecSeg8k.csv
11) Full Dataset Experiments
On the Semantic Segmentation task, we tested both with lower resolution input and higher resolution input.
Each script will perform 3 runs on different seeds. We always use the same 3 fixed seeds: 1665, 8914 and 37.
To reproduce our results, run the following scripts:
a) Low Res - EndoViT's pre-training resolution (224 x 224)
-------------- EndoViT -------------
source ./finetuning/semantic_segmentation/output_dir/low_res/full_dataset/EndoViT/hyperparam_script
------------- ImageNet -------------
source ./finetuning/semantic_segmentation/output_dir/low_res/full_dataset/ImageNet/hyperparam_script
----------- NoPretraining ----------
source ./finetuning/semantic_segmentation/output_dir/low_res/full_dataset/NoPretraining/hyperparam_script
b) High Res - resolution used in the benchmark paper (256 x 448)
-------------- EndoViT -------------
source ./finetuning/semantic_segmentation/output_dir/high_res/full_dataset/EndoViT/hyperparam_script
------------- ImageNet -------------
source ./finetuning/semantic_segmentation/output_dir/high_res/full_dataset/ImageNet/hyperparam_script
----------- NoPretraining ----------
source ./finetuning/semantic_segmentation/output_dir/high_res/full_dataset/NoPretraining/hyperparam_script
Compared to the ImageNet pre-trained model, EndoViT has more globally consistent outputs (highlighted in black). Furthermore, it is significantly better at reconstructing instruments' tips (highlighted in red).
Compared to the results introduced in the following benchmark, EndoViT outperforms other Transformers (UNETR) as well as various CNN architectures (including the best performing U-Net++).
12) Few-shot Learning Experiments
Few-shot experiments are always performed by training on a fixed number of training videos. In the case of CholecSeg8k dataset, on 1, 2 or 4 out of in total 13 training videos.
Each script will perform 3 runs on different video subsets. We always use the same fixed video subsets: Few-shot Learning Subsets
To reproduce our results, run the following scripts:
a) Low Res - EndoViT's pre-training resolution (224 x 224)
-------------- EndoViT -------------
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/EndoViT/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/EndoViT/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/EndoViT/4_vids_only/hyperparam_script
------------- ImageNet -------------
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/ImageNet/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/ImageNet/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/ImageNet/4_vids_only/hyperparam_script
----------- NoPretraining ----------
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/NoPretraining/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/NoPretraining/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/low_res/less_training_data/NoPretraining/4_vids_only/hyperparam_script
b) High Res - resolution used in the benchmark paper (256 x 448)
-------------- EndoViT -------------
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/EndoViT/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/EndoViT/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/EndoViT/4_vids_only/hyperparam_script
------------- ImageNet -------------
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/ImageNet/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/ImageNet/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/ImageNet/4_vids_only/hyperparam_script
----------- NoPretraining ----------
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/NoPretraining/1_vid_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/NoPretraining/2_vids_only/hyperparam_script
source ./finetuning/semantic_segmentation/output_dir/high_res/less_training_data/NoPretraining/4_vids_only/hyperparam_script
In the Action Triplet Detection task, the goal is to detect actions performed by a surgeon in every frame of a surgical procedure. Importantly, the actions need to be expressed as a triplet of the form:
< Instrument, Verb, Target >.
Where Verb
is the action performed using a surgical Instrument
on a Target
anatomical structure.
We build our code upon the Rendezvous (RDV) model, designed specifically for the task. The task itself is described in more detail in RDV repository.
In the end, we haven't successfully integrated ViT into RDV. Therefore, we test using a simple model consisting of a feature extraction backbone (either ResNet50 or ViT) and a single linear layer.
13) Download CholetT45 dataset (150 GB)
./datasets/CholecT45
CholecT45
folder should include data
subfolder containing the raw images and annotation subfolders: triplet
, instrument
, verb
and target
.14) Full Dataset Experiments
Each script will perform 3 runs on different seeds. We always use the same 3 fixed seeds: 1665, 8914 and 37.
To reproduce our results, run the following scripts:
########## ViT Backbone ##########
-------------- EndoViT -------------
source ./finetuning/action_triplet_detection/output_dir/full_dataset/ViT_backbone/EndoViT/training_script
------------- ImageNet -------------
source ./finetuning/action_triplet_detection/output_dir/full_dataset/ViT_backbone/ImageNet/training_script
----------- NoPretraining ----------
source ./finetuning/action_triplet_detection/output_dir/full_dataset/ViT_backbone/NoPretraining/training_script
########## CNN Backbone ##########
------------- ResNet50 -------------
source ./finetuning/action_triplet_detection/output_dir/full_dataset/ResNet50_backbone/training_script
15) Few-shot Learning Experiments
Few-shot experiments are always performed by training on a fixed number of training videos. In the case of CholecT45 dataset, on 2, 4 or 8 out of in total 31 training videos.
Each script will perform 3 runs on different video subsets. We always use the same fixed video subsets: Few-shot Learning Subsets
To reproduce our results, run the following scripts:
########## ViT Backbone ##########
-------------- EndoViT -------------
source ./finetuning/action_triplet_detection/output_dir/less_training_data/ViT_backbone/EndoViT/training_script
------------- ImageNet -------------
source ./finetuning/action_triplet_detection/output_dir/less_training_data/ViT_backbone/ImageNet/training_script
----------- NoPretraining ----------
-> Since the results of training from scratch in Full Dataset Experiment were significantly worse, we skip this training.
########## CNN Backbone ##########
------------- ResNet50 -------------
source ./finetuning/action_triplet_detection/output_dir/less_training_data/ResNet50_backbone/training_script
16) Full Dataset Experiments
Each script will perform 3 runs on different seeds. We always use the same 3 fixed seeds: 1665, 8914 and 37.
To reproduce our results, run the following scripts:
########## ViT Backbone ##########
-------------- EndoViT -------------
source ./finetuning/surgical_phase_recognition/output_dir/full_dataset/ViT_backbone/EndoViT/training_script
------------- ImageNet -------------
source ./finetuning/surgical_phase_recognition/output_dir/full_dataset/ViT_backbone/ImageNet/training_script
----------- NoPretraining ----------
source ./finetuning/surgical_phase_recognition/output_dir/full_dataset/ViT_backbone/NoPretraining/training_script
########## CNN Backbone ##########
------------- ResNet50 -------------
source ./finetuning/surgical_phase_recognition/output_dir/full_dataset/ResNet50_backbone/training_script
17) Few-shot Learning Experiments
Few-shot experiments are always performed by training on a fixed number of training videos. In the case of Cholec80 dataset, on 2, 4 or 8 out of in total 40 training videos.
Each script will perform 3 runs on different video subsets. We always use the same fixed video subsets: Few-shot Learning Subsets
To reproduce our results, run the following scripts:
########## ViT Backbone ##########
-------------- EndoViT -------------
source ./finetuning/surgical_phase_recognition/output_dir/less_training_data/ViT_backbone/EndoViT/training_script
------------- ImageNet -------------
source ./finetuning/surgical_phase_recognition/output_dir/less_training_data/ViT_backbone/ImageNet/training_script
----------- NoPretraining ----------
-> Since the results of training from scratch in Full Dataset Experiment were significantly worse, we skip this training.
########## CNN Backbone ##########
------------- ResNet50 -------------
source ./finetuning/surgical_phase_recognition/output_dir/less_training_data/ResNet50_backbone/training_script