cannot reproduce the reported results

licaizi commented 2 years ago

a very large gap between the reported results and mines, did I miss any important tricks? The details of my experiments are presented as follows, 1, fix two bugs:

2, prepare acdc dataset via generate_acdc.py

3, prepare running scripts: (1) from scratch: python train_supervised.py --device cuda:0 --batch_size 10 --epochs 200 --data_dir ./dir_for_labeled_data --lr 5e-4 --min_lr 5e-6 --dataset acdc --patch_size 352 352 --experiment_name supervised_acdc_random_sample6 --initial_filter_size 48 --classes 4 --enable_few_data --sampling_k 6;

(2) contrastive learning: python train_contrast.py --device cuda:0 --batch_size 32 --epochs 300 --data_dir ./dir_for_unlabeled_data --lr 0.01 --do_contrast --dataset acdc --patch_size 352 352 --experiment_name contrast_acdc_pcl_temp01thresh035 --slice_threshold 0.35 --temp 0.1 --initial_filter_size 48 --classes 512 --contrastive_method pcl;

python train_contrast.py --device cuda:0 --batch_size 32 --epochs 300 --data_dir ./dir_for_unlabeled_data --lr 0.01 --do_contrast --dataset acdc --patch_size 352 352 --experiment_name contrast_acdc_gcl_temp01thresh035 --slice_threshold 0.35 --temp 0.1 --initial_filter_size 48 --classes 512 --contrastive_method gcl;

python train_contrast.py --device cuda:0 --batch_size 32 --epochs 300 --data_dir ./dir_for_unlabeled_data --lr 0.01 --do_contrast --dataset acdc --patch_size 352 352 --experiment_name contrast_acdc_simclr_temp01thresh035 --slice_threshold 0.35 --temp 0.1 --initial_filter_size 48 --classes 512 --contrastive_method simclr;

(3) finetuning: python train_supervised.py --device cuda:0 --batch_size 10 --epochs 100 --data_dir ./dir_for_labeled_data --lr 5e-5 --min_lr 5e-6 --dataset acdc --patch_size 352 352 --experiment_name supervised_acdc_simclr_sample6 --initial_filter_size 48 --classes 4 --enable_few_data --sampling_k 6 --restart --pretrained_model_path ./results/contrast_acdc_simclr_temp01_thresh035_2021-12-05_09-43-38/model/latest.pth;

python train_supervised.py --device cuda:1 --batch_size 10 --epochs 100 --data_dir ./dir_for_labeled_data --lr 5e-5 --min_lr 5e-6 --dataset acdc --patch_size 352 352 --experiment_name supervised_acdc_gcl_sample6 --initial_filter_size 48 --classes 4 --enable_few_data --sampling_k 6 --restart --pretrained_model_path ./results/contrast_acdc_gcl_temp01_thresh035_2021-12-04_03-46-35/model/latest.pth;

python train_supervised.py --device cuda:1 --batch_size 10 --epochs 100 --data_dir ./dir_for_labeled_data --lr 5e-5 --min_lr 5e-6 --dataset acdc --patch_size 352 352 --experiment_name supervised_acdc_pcl_sample6 --initial_filter_size 48 --classes 4 --enable_few_data --sampling_k 6 --restart --pretrained_model_path ./results/contrast_acdc_pcl_temp01_thresh035_2021-12-02_21-48-13/model/latest.pth;

4, experimental results(ubuntu16.04, pytorch1.9, NVIDIA 2080Ti * 2, Dice metric, take sample_k=6 as an example):

dewenzeng commented 2 years ago

@CaiziLee Thanks for pointing out the bugs. I guess the batchgenerators package has been updated, some of the functions have been moved to other places. Looks like all the pre-trained models are not working in your case, have you checked the contrastive learning loss? My suggestion is to use a larger initial learning rate in contrastive learning like 0.1. Also, during fine-tuning, set the learning rate to the same as the train from scratch. Starting from 5e-5 seems to be too small. Here is a contrastive learning loss example and a fine-tune result example of mine

licaizi commented 2 years ago

Hi, thanks for your reply, resetting the lr of finetuning works, the best mean Dice of 5 folds of mine is 0.8421(0.025) , I think it's a normal bias, thanks again for your wonderful work.

licaizi commented 2 years ago

But, with only labeled dataset(100 patients), the best mean Dice(sample_k=6) is 0.7858, which has no improvement against the baseline(0.7883). 1, It seems like the dataset scale has a large impact on performance. 2, Considering you used all data including testing data, to some extent, is it a data leakage? Even though you did not use manual label during pretraining, the testing data has already been used in pre-training stage with self-defined labels.

dewenzeng commented 2 years ago

But, with only labeled dataset(100 patients), the best mean Dice(sample_k=6) is 0.7858, which has no improvement against the baseline(0.7883). 1, It seems like the dataset scale has a large impact on performance. 2, Considering you used all data including testing data, to some extent, is it a data leakage? Even though you did not use manual label during pretraining, the testing data has already been used in pre-training stage with self-defined labels.

@CaiziLee For your questions:

Yes, contrastive learning does rely on the dataset scale. I remember seeing a little improvement compared with vanilla baseline when using only labeled ACDC, maybe some hyperparameters are not the same. I have to rerun some of the experiments to check on that.
Yes, there is probably a data leakage. I think a better way to evaluate this is to train CL on the training set and only test on the test set. Or if using cross-validation, it's better to pre-trained a CL model on each cross-validation partition, although this would make it a little complicated. But anyway, the baselines are using the same data as pcl, so I guess that still can say something. Also, the transfer learning results do not have this problem.

dewenzeng / positional_cl

cannot reproduce the reported results #3