dangnh0611 / kaggle_rsna_breast_cancer

1st place solution of RSNA Screening Mammography Breast Cancer Detection competition on Kaggle: https://www.kaggle.com/competitions/rsna-breast-cancer-detection
MIT License
79 stars 26 forks source link
binary-classification breast-cancer class-imbalance convnext deep-learning kaggle machine-learning mamography rsna yolo

1st place solution for RSNA Screening Mammography Breast Cancer Detection competition on Kaggle

Solution write up: https://www.kaggle.com/competitions/rsna-breast-cancer-detection/discussion/392449

overall pipeline

Notes:

Please download those trained models and put in assets/trained/:

# this assume that kaggle api is installed: https://github.com/Kaggle/kaggle-api
kaggle datasets download -d dangnh0611/rsna-breast-cancer-detection-best-ckpts -p assets/trained
unzip rsna-breast-cancer-detection-best-ckpts.zip -d assets/trained/
rm assets/trained/rsna-breast-cancer-detection-best-ckpts.zip

TABLE OF CONTENTS

1. ARCHIVE CONTENTS


SETTINGS.json defines base paths for IO:

2. HARDWARE

The following machine were used to create the final solution: NVIDIA DGX A100. Most of my experiments can be done using 1-3 A100 GPUs. However, final results can be easily reproduced using a single A100 GPU (40GB GPU Memory).

3. DATA SETUP

Refer to docs/DATASETS.md for details on how to correctly setup datasets.

4. SOLUTION PIPELINE

There are some stages to reproduce the entire solutions. I will briefly describe it for easier further understanding.

  1. Train a YOLOX on some of competition images for breast ROI detection
    • Convert competition dicom files to 8-bits png images
    • Convert detection labels in YOLOv5 format to COCO format (YOLOX accepts COCO format without any modifications)
    • Train a YOLOX-nano 416x416 model on those images (521 train images, 50 val images)
    • Convert trained YOLOX model from Torch to TensorRT engine.
  2. Using trained YOLOX TensorRT engine to crop breast ROI region, save to disk as 8-bits pngs
    • Clean and re-structure raw datasets (competition data + external data) in an unified way (standardize the format/structure)
    • Dicom decoding --> ROI detection (YOLOX) --> ROI crop --> normalization --> save to disk
  3. Train Convnext-small model for classification using those saved ROI images
    • Do a 4-folds splits on competition data.
    • Train 4 Convnext-small model on each folds
    • Select best checkpoint for each fold
    • Convert those models from Torch to TensorRT
  4. Inference on test data (submission)

5. SOLUTION REPRODUCING

All the following instructions assume that datasets (competition + external data) are all set up. There are 4 options to reproduce the solutions:

  1. Use trained models

    • No training, just use trained models in assets/trained to make predictions
  2. Do not re-train YOLOX, fully reproduce Convnext-small classification models

    • Skip re-train the YOLOX part, use (my) trained YOLOX for further steps
    • Re-train 4x Convnext-small classification models. This part can be 100% reproduced (give you identical models/training log/result) without any randomness.
    • This method should give 100% identical score on both CV/LB/PB
  3. Re-train all parts (reproduce from scratch)

    • Won't use any of (my) trained models in any parts, but re-train all of theme from scratch
    • This may not give 100% identical results/scores. The reason is that YOLOX can't be fully reproduced to get EXACTLY same model as used in winning submission. More details here
    • Note that dataset used for training Convnext-small classification models is generated base on YOLOX's prediction, so changes in YOLOX will cause changes in Convnext-small classification models --> Convnext-small classification models will also be unreproducible (in a 100% way).
    • But in general, it should give nearly identical results/scores within a reasonable margin.

5.1. Use trained models to make predictions

5.1.1. Convert trained YOLOX to TensorRT

A YOLOX-nano 416 engine which was optimized for NVIDIA A100 is provided at assets/trained/yolox_nano_416_roi_trt_a100.pth. However, the recommended way is to convert it to TensorRT, optimized for your environment/hardware:

PYTHONPATH=$(pwd)/src/roi_det/YOLOX:$PYTHONPATH python3 src/roi_det/YOLOX/tools/trt.py \
    -expn trained_yolox_nano_416_to_tensorrt \
    -f src/roi_det/YOLOX/exps/projects/rsna/yolox_nano_bre_416.py \
    -c assets/trained/yolox_nano_416_roi_torch.pth \
    --save-path assets/trained/yolox_nano_416_roi_trt.pth \
    -b 1

Behaviors:

5.1.2. Convert trained 4 x Convnext-small models to TensorRT

PYTHONPATH=$(pwd)/src/pytorch-image-models/:$PYTHONPATH python3 src/tools/convert_convnext_tensorrt.py --mode trained

Behaviours: Save a 4-folds combined TensorRT engine to ./assets/trained/best_ensemble_convnext_small_batch2_fp32.engine'.

It takes 5-10 minutes for Kaggle's P100 GPU to finish, but take about 1 hour for A100 GPU (my case).

5.1.3. Submission

PYTHONPATH=$(pwd)/src/pytorch-image-models/:$PYTHONPATH python3 src/submit/submit.py --mode trained --trt

Behaviours:


5.2. Keep trained YOLOX, re-train Convnext-small classification models

5.2.1. Convert trained YOLOX to TensorRT

A YOLOX-nano 416 engine which was optimized for NVIDIA A100 is provided at assets/trained/yolox_nano_416_roi_trt_a100.pth. However, the recommended way is to convert it to TensorRT, optimized for your environment/hardware:

PYTHONPATH=$(pwd)/src/roi_det/YOLOX:$PYTHONPATH python3 src/roi_det/YOLOX/tools/trt.py \
    -expn trained_yolox_nano_416_to_tensorrt \
    -f src/roi_det/YOLOX/exps/projects/rsna/yolox_nano_bre_416.py \
    -c assets/trained/yolox_nano_416_roi_torch.pth \
    --save-path assets/trained/yolox_nano_416_roi_trt.pth \
    -b 1

Behaviors:

5.2.2. Prepair datasets to train classification models

python3 src/tools/prepair_classification_dataset.py --num-workers 8 --roi-yolox-engine-path assets/trained/yolox_nano_416_roi_trt.pth

Behaviors:

5.2.3. Perform 4-folds splitting on competition data

python3 src/tools/cv_split.py

Behaviors: Create new directory and saving csv files in {PROCESSED_DATA_DIR}/rsna-breast-cancer-detection/cv/v2/

5.2.4. Training 4 x Convnext-small classification models

python3 src/tools/make_train_bash_script.py --mode fully_reproduce

This will save a file named _train_script_auto_generated.sh in current directory, which include commands and instructions to train Convnext-small classification models. To reproduce using single GPU, simply run

sh ./_train_script_auto_generated.sh

This could take 8 days to finish training (around 2 days for each fold).

Or if you have multiple GPUs and want to speed up training, simply follow instructions in the generated train script _train_script_auto_generated.sh and run each command in parallel using different GPUs. For more details on the training process, take a look at my write up, part 4.3.Training

Behaviours:

5.2.5. Checkpoints selection

python3 src/tools/select_classification_best_ckpts.py --mode fully_reproduce

Behaviours:

5.2.6. Convert selected best Convnext models to TensorRT

PYTHONPATH=$(pwd)/src/pytorch-image-models/:$PYTHONPATH python3 src/tools/convert_convnext_tensorrt.py --mode reproduce

Behaviours: Save a 4-folds combined TensorRT engine to {MODEL_FINAL_SELECTION_DIR}/best_ensemble_convnext_small_batch2_fp32.engine'.

It takes 5-10 minutes for Kaggle's P100 GPU to finish, but take about 1 hour for A100 GPU (my case).

5.2.7. Submission

PYTHONPATH=$(pwd)/src/pytorch-image-models/:$PYTHONPATH python3 src/submit/submit.py --mode partial_reproduce --trt

Behaviours:


5.3. Re-train all parts from scratch

5.3.1. Prepair dataset for training YOLOX ROI detector

python3 src/tools/prepair_roi_det_dataset.py --num-workers 4

Behaviors:

5.3.2. Retrain YOLOX for breast ROI detection

sh src/tools/train_and_convert_yolox_trt.sh

Behaviors:

5.3.3. Prepair datasets to train classification models

This will use newly trained YOLOX in previous step as breast ROI extractor.

python3 src/tools/prepair_classification_dataset.py --num-workers 8

Behaviors:

5.3.4. Perform 4-folds splitting on competition data

python3 src/tools/cv_split.py

Behaviors: Create new directory and saving csv files in {PROCESSED_DATA_DIR}/rsna-breast-cancer-detection/cv/v2/

5.3.5. Training 4 x Convnext-small classification models

python3 src/tools/make_train_bash_script.py --mode fully_reproduce

This will save a file named _train_script_auto_generated.sh in current directory, which include commands and instructions to train Convnext-small classification models. To reproduce using single GPU, simply run

sh ./_train_script_auto_generated.sh

This could take 8 days to finish training (around 2 days for each fold).

Or if you have multiple GPUs and want to speed up training, simply follow instructions in the generated train script _train_script_auto_generated.sh and run each command in parallel using different GPUs. For more details on the training process, take a look at my write up, part 4.3.Training

Behaviours:

5.3.6. Checkpoints selection

python3 src/tools/select_classification_best_ckpts.py --mode fully_reproduce

Behaviours:

5.3.7. Convert selected best Convnext models to TensorRT

PYTHONPATH=$(pwd)/src/pytorch-image-models/:$PYTHONPATH python3 src/tools/convert_convnext_tensorrt.py --mode reproduce

Behaviours: Save a 4-folds combined TensorRT engine to {MODEL_FINAL_SELECTION_DIR}/best_ensemble_convnext_small_batch2_fp32.engine'.

It takes 5-10 minutes for Kaggle's P100 GPU to finish, but take about 1 hour for A100 GPU (my case).

5.3.8. Submission

PYTHONPATH=$(pwd)/src/pytorch-image-models/:$PYTHONPATH python3 src/submit/submit.py --mode reproduce --trt

Behaviours: