Jiangbo-Shi / ViLa-MIL

ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification (CVPR 2024)
28 stars 1 forks source link

ViLa-MIL

Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification, CVPR 2024.

Jiangbo Shi, Chen Li, Tieliang Gong, Yefeng Zheng, Huazhu Fu

[Open Access Version] | [Cite]

Abstract: Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However, these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides, which are easily affected by variations in data distribution. Recently, vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However, the previous text prompt lacks the consideration of pathological prior knowledge, therefore does not substantially boost the model's performance. Moreover, the collection of such pairs and the pre-training process are very time-consuming and source-intensive. To solve the above problems, we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically, we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently, for the image branch, we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch, we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.

1. Pre-requisites

Python (3.7.7), h5py (2.10.0), matplotlib (3.1.1), numpy (1.18.1), opencv-python (4.1.1), openslide-python (1.1.1), openslide (3.4.1), pandas (1.1.3), pillow (7.0.0), PyTorch (1.6.0), scikit-learn (0.22.1), scipy (1.4.1), tensorboardx (1.9), torchvision (0.7.0), captum (0.2.0), shap (0.35.0), clip (1.0).

2. Download Dataset

The two public TCGA-RCC and TCGA-Lung datasets can be downloaded in NIH Genomic Data Commons Data Portal. For the specific downloading tool, please refer to GDC Data Transfer.

3. Prepare Dataset File

For each dataset, a .csv file is needed in the following format and put it into the dataset_csv folder:

Headname: 'case_id, slide_id, label'
Each line: 'patient_0, TCGA-UZ-A9PS-01Z-00-DX1.3CAF5087-CDAB-42B5-BE7C-3D5643AEAA6D, CCRCC'

4. Preprocess Data

4.1 Generate Patch Coordinate

Utilize the sliding window method to generate the patch coordinates from WSI for cropping.

python create_patches_fp.py \
--source SOURCE_FOLDER \
--slide_name_file SLIDE_NAME_FILE \
--preset tcga.csv \
--save_dir SAVE_FOLDER \
--patch_size 1024 \
--step_size 1024 \
--seg \
--patch \
--uuid_name_file UUID_NAME_FILE

The list of parameters is as follows:

4.2 Crop Patches

Utilize the above generated patch coordinate file to crop the patches.

cd feature_extraction
python patch_generation.py

Note: the pathes between line 12 and 17 are needed to define in patch_generation.py.

Paramters Descriptions:

4.3 Extract Patch Features

Utilize a pretrain encoder (e.g., CLIP-pretrained ResNet50) to extract the patch features.

python patch_extraction.py \
--patches_path 'PATCHES_PATH' \
--library_path 'LIBRARY_PATH' \
--model_name 'clip_RN50' \
--batch_size 64 \

Parameter Descriptions:

5. Split Datasets

5.1 Generate k splitting datasets with different seeds.

python create_splits_seq.py \
--label_frac 1 \
--k 5 \
--task 'TASK' \
--val_frac VAL_FRAC \
--test_frac TEST_FRAC \
--dataset DATASET \

Parameter Descriptions:

5.2 Build the few-shot dataset split.

python create_splits_fewshot.py

Note: the parameters are needed to define between the line 5 and 8 in the create_splits_fewshot.py.

Parameter Descriptions:

6. Prepare Text Prompt

By querying the frozen LLM (e.g., GPT-3.5) with the following question, the text prompt will be generated automatically.

Q: What are the visually descriptive characteristics of {class name} at low and high resolution in the whole slide image?

The specific text prompt files for TCGA-RCC and TCGA-Lung datasets are stored in the folder text_prompt using GPT-3.5.

7. Training Model

Run the following script, and ViLa-MIL will be trained.

export CUDA_VISIBLE_DEVICES=0
python main.py \
--seed 1 \
--drop_out \
--early_stopping \
--lr 1e-4 \
--k 5 \
--label_frac 1 \
--bag_loss ce \
--task 'TASK' \
--results_dir 'RESULT_DIR' \
--exp_code 'EXP_CODE' \
--model_type ViLa_MIL \
--mode transformer \
--log_data \
--data_root_dir 'DATA_ROOT_DIR' \
--data_folder_s 'DATA_FOLDER_S' \
--data_folder_l 'DATA_FOLDER_L' \
--split_dir 'SPLIT_DIR' \
--text_prompt_path 'TEXT_PROMPT_PATH' \
--prototype_number 16 \ 

Parameter Descriptions:

8. Eval Model

Eval the trained model using the following script.

export CUDA_VISIBLE_DEVICES=0
python eval.py \
--drop_out \
--k 5 \
--k_start 0 \
--k_end 5 \
--task 'TASK' \
--results_dir 'RESULTS_DIR' \
--models_exp_code 'MODELS_EXP_CODE' \
--save_exp_code 'SAVE_EXP_CODE' \
--model_type ViLa_MIL \
--mode transformer \
--splits_dir 'SPLITS_DIR' \
--data_root_dir 'DATA_ROOT_DIR' \
--data_folder_s 'DATA_FOLDER_S' \
--data_folder_l 'DATA_FOLDER_L' \
--text_prompt_path 'TEXT_PROMPT_PATH' \
--prototype_number 16 \ 

Parameter Descriptions:

Note that the symboles $N_l$ and $N$ in Eqs. (6) and (7) of the main manuscript should be revised into $N_p$.

Citation

If you find our work useful in your research, please consider citing our paper at:

@inproceedings{shi2024vila,
 title={{ViLa-MIL}: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification},
 author={Shi, Jiangbo and Li, Chen and Gong, Tieliang and Zheng, Yefeng and Fu, Huazhu},
 booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
 pages={11248--11258},
 year={2024}
}

Acknowledge

This project is based on CLAM, CoOp, and Self-Supervised-ViT-path. We have great thanks for these awesome projects.