Encountering problems while training models on the Camelyon16 dataset

Ysc-shark commented 8 months ago

Hi,thanks for sharing your great work and it really helps me a lot.

I am trying to replicate the results of DSMIL and make some modifications based on it. However, I encountered some problems when training on the Camelyon16 dataset. I would greatly appreciate any guidance and advice, as my experience with training deep learning models and analyzing pathological images is limited.

Firstly, I randomly divided the official training set into 5 folds with even distribution of labels, and then conducted 5-fold cross-validation. In this process, the model was trained on 4 folds, with one fold serving as the validation set. The best model was then used to predict results on the official test set. The final results of the 5-fold cross-validation were averaged. Is this approach correct?
Secondly, when training my own designed MIL model, I sometimes experience unstable training（such as some fold）, or even an ongoing increase in val_loss, as shown in the figure below. But sometimes However, I select the model with the highest AUC on the validation set, and the final results on the test set are not too bad. I wonder if these training issues arise from a flaw in my model design or improper training parameter settings. I have adopted the training parameter settings from the DSMIL code. criterion = nn.BCEWithLogitsLoss() optimizer = torch.optim.Adam(milnet.parameters(), lr=0.0001, betas=(0.5, 0.9), weight_decay=5e-3) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 200, 0.000005)
Third, when training with Camelyon16 features downloaded from script, I achieve an exceptionally high AUC of 0.97, as mentioned in #49. Then, I downloaded the 'c16-multiscale-features' provided by the authors and used the first 512 dimensions of the 1024-dimensional features as single-scale features for training my model, but encountered several issues. (1) DSMIL achieved AUCs of 0.77 and 0.81 on multi-scale and single-scale, respectively, far below the results reported in the paper. (2) Regardless of the model, the AUC on single-scale is always higher than on multi-scale. (3) My MIL model's AUC ranges from 0.84 to 0.87 on single-scale and from 0.81 to 0.84 on multi-scale. I wonder if anyone else has encountered similar issues, or is it a problem on my end?

Ysc-shark commented 8 months ago

4k_TKAMIL_Camelyon_ss_Simclr_fold4 Sometimes the traning process seems reasonable

binli123 commented 8 months ago

I incorporated the training/testing into the same pipeline in the latest commit. You can set --eval_scheme=5-fold-cv-standalone-test which will perform a train/valid/test like this:

A standalone test set consisting of 20% samples is reserved, remaining 80% of samples are used to construct a 5-fold cross-validation. For each fold, the best model and corresponding threshold are saved. After the 5-fold cross-validation, 5 best models along with the corresponding optimal thresholds are obtained which are used to perform inference on the reserved test set. A final prediction for a test sample is the majority vote of the 5 models. For a binary classification, accuracy and balanced accuracy scores are computed. For a multi-label classification, hamming loss (smaller the better) and subset accuracy are computed.

You can also simply run a 5-fold cv --eval_scheme=5-fold-cv

There were some issues with the testing script when loading pretrained weights (i.e., sometimes the weights are not fully loaded or there are missing weights, set strict=False can reveal the problems.) I will fix this in a couple of days.

binli123 / dsmil-wsi

Encountering problems while training models on the Camelyon16 dataset #93