Chenxin Li1 [Hengyu Liu]()1 Yifan Liu1* Brandon Y. Feng2 Wuyang Li1 Xinyu Liu1 Zhen Chen3 Jing Shao4 Yixuan Yuan1β
1CUHK 2MIT CSAIL 3CAS CAIR 4Shanghai AI Lab
* Equal Contributions. β Corresponding Author.
git clone https://github.com/XGGNet/Endora.git
cd Endora
conda create -n Endora python=3.10
conda activate Endora
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Tips A: We test the framework using pytorch=2.1.2, and the CUDA compile version=11.8. Other versions should be also fine but not totally ensured.
Tips B: GPU with 24GB (or more) is recommended for video sampling by Endora inference, and 48GB (or more) for Endora training.
Colonoscopic: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.
Kvasir-Capsule: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.
CholecTriplet: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.
Please run process_data.py
and process_list.py
to get the split frames and the corresponding list at first.
CUDA_VISIBLE_DEVICES=gpu_id python process_data.py -s /path/to/datasets -t /path/to/save/video/frames
CUDA_VISIBLE_DEVICES=gpu_id python process_list.py -f /path/to/video/frames -t /path/to/save/text
The resulted file structure is as follows.
βββ data
β βββ CholecT45
β βββ 00001.mp4
| βββ ...
β βββ Colonoscopic
β βββ 00001.mp4
| βββ ...
β βββ Kvasir-Capsule
β βββ 00001.mp4
| βββ ...
β βββ CholecT45_frames
β βββ train_128_list.txt
β βββ 00001
β βββ 00000.jpg
| βββ ...
| βββ ...
β βββ Colonoscopic_frames
β βββ train_128_list.txt
β βββ 00001
β βββ 00000.jpg
| βββ ...
| βββ ...
β βββ Kvasir-Capsule_frames
β βββ train_128_list.txt
β βββ 00001
β βββ 00000.jpg
| βββ ...
| βββ ...
You can directly sample the endoscopy videos from the checkpoint model. Here is an example for quick usage for using our pre-trained models:
sample.py
by the following scripts to customize the various arguments like adjusting sampling steps.
Simple Sample to generate a video
bash sample/col.sh
bash sample/kva.sh
bash sample/cho.sh
DDP sample
bash sample/col_ddp.sh
bash sample/kva_ddp.sh
bash sample/cho_ddp.sh
The weight of pretrained DINO can be found here, and in our implementation we use ViT-B/8 during training Endora. And the saved path need to be edited in ./configs
Train Endora with the resolution of 128x128 with N
GPUs on the Colonoscopic dataset
torchrun --nnodes=1 --nproc_per_node=N train.py \
--config ./configs/col/col_train.yaml \
--port PORT \
--mode type_cnn \
--prr_weight 0.5 \
--pretrained_weights /path/to/pretrained/DINO
Run training Endora with scripts in ./train_scripts
bash train_scripts/col/train_col.sh
bash train_scripts/kva/train_kva.sh
bash train_scripts/cho/train_cho.sh
We first split the generated videos to frames and use the code from StyleGAN to evaluate the model in terms of FVD, FID and IS.
Test with process_data.py
and code in stylegan-v
CUDA_VISIBLE_DEVICES=gpu_id python process_data.py -s /path/to/generated/video -t /path/to/video/frames
cd /path/to/stylegan-v
CUDA_VISIBLE_DEVICES=gpu_id python ./src/scripts/calc_metrics_for_dataset.py \
--fake_data_path /path/to/video/frames \
--real_data_path /path/to/dataset/frames
Test with scipt test.sh
bash test.sh
We provide the code of training and testing scripts of compared methods on endoscopy video generation (as shown in Table 1. Quantitative Comparison in paper).
Please enter Other-Methods/
for more details. We will keep cleaning up the code.
The pre-trained weights for all the comparison methods are available here.
Here is an overview of performance&checkpoints on Colonoscopic Dataset. | Method | FVDβ | FIDβ | ISβ | Checkpoints |
---|---|---|---|---|---|
StyleGAN-V | 2110.7 | 226.14 | 2.12 | Link | |
LVDM | 1036.7 | 96.85 | 1.93 | Link | |
MoStGAN-V | 468.5 | 53.17 | 3.37 | Link | |
Endora (Ours) | 460.7 | 13.41 | 3.90 | Link |
We also provide the training of other variants of Endora (as shown in Table 3. Ablation Studies in paper). Training and Sampling Scripts are in train_scripts/ablation
and sample/ablation
respectively.
bash /train_scripts/ablation/train_col_ablation{i}.sh % e.g., i=1 to run the 1st-row ablation experiments.
bash /sample/ablation/col_ddp_ablation{i}.sh % e.g., i=1 to run the 1st-row ablation experiments.
Modified Diffusion | Spatiotemporal Encoding | Prior Guidance | FVDβ | FIDβ | ISβ | Checkpoints |
---|---|---|---|---|---|---|
❌ | ❌ | ❌ | 611.9 | 22.44 | 3.61 | Link |
✅ | ❌ | ❌ | 593.7 | 17.75 | 3.65 | Link |
✅ | ✅ | ❌ | 493.5 | 13.88 | 3.89 | Link |
✅ | ✅ | ✅ | 460.7 | 13.41 | 3.90 | Link |
We provide the reproduction steps for reproducing the results of extending Endora to downstream applications (as shown in Section 3.3 in paper).
Please follow the steps:
bash semi_baseline.sh
to obtain the Supervised-only lowerbound of semi-supervised disease diagnosis.bash semi_gen.sh
for semi-supervised disease diagnosis using the augmented unlabeled data.Method | Colonoscopic | CholeTriplet |
---|---|---|
Supervised-only | 74.5 | 74.5 |
LVDM | 76.2 | 78.0 |
Endora (Ours) | 87.0 | 82.0 |
Please follow the steps:
Videos of Rendered RGB & Rendered Depth
Greatly appreciate the tremendous effort for the following projects!
If you find this work helpful for your project,please consider citing the following paper:
@article{li2024endora,
author = {Chenxin Li and Hengyu Liu and Yifan Liu and Brandon Y. Feng, and Wuyang Li and Xinyu Liu, Zhen Chen and Jing shao and Yixuan Yuan},
title = {Endora: Video Generation Models as Endoscopy Simulators},
journal = {arXiv preprint arXiv:2403.11050},
year = {2024}
}