binli123 / dsmil-wsi

DSMIL: Dual-stream multiple instance learning networks for tumor detection in Whole Slide Image
MIT License
332 stars 84 forks source link

Strange score for TCGA #61

Closed xiaozhu0816 closed 3 months ago

xiaozhu0816 commented 1 year ago
    > Dear bin, Thank you for your great work!
  • When I reproduce the results on c-16 and TCGA, I follow the provided readme: 1) Using pre-computed features from Download feature vectors for MIL network --> python download.py --dataset=tcga/c16,2)Training the model (with all hyperparameters as default) python train_tcga.py --dataset=TCGA-lung-default/python train_tcga.py --dataset=Camelyon16 --num_classes=1. For c16, I found there is mild degradation in accuracy of 91% unlike Problem of reproduce Camelyon16 result #54 with only 60%. But I did find each patch will produce the same attention score as Problem of reproduce Camelyon16 result #54. For TCGA, the same attention score can also be found but with quite promising results (e.g., train loss: 0.3307 test loss: 0.3239, average score: 0.9000, AUC: class-0>>0.9715089374829871|class-1>>0.9658833136738953). The problem of the same attention score on c16 may sometimes be solved by restarting the training with the init.pth loaded, but never solved on TCGA. How to do with it?
  • When I use the provided pre-trained aggregator (.test/weights/aggregator.pth or .test-c16/weights/aggregator.pth) to the test set of pre-computed feature from Download feature vectors for MIL network --> python download.py --dataset=tcga/c16. I got reasonable results (average score: 0.9125, AUC: class-0>>0.9546666666666667) on c-16, but unreasonable ones (average score: 0.6857, AUC: class-0>>0.8621722166772525|class-1>>0.8949278649850286) on TCGA. I wonder whether these pre-trained aggregators can only work with the provided embedder (test/weights/embedder.pth or .test-c16/weights/embedder.pth) instead of pre-computed features? In other words, the pre-computed features are not generated by these pre-trained embedders?

Looking forward to your help! Best, Tiancheng Lin

Hi, @HHHedo & @binli123 . I have the same question with @HHHedo. I focus on the TCGA part now, and followed the instruction. 1) Using pre-computed features fromDownload feature vectors for MIL network --> $ python download.py --dataset=tcga 2) Training the model (with all hyperparameters as default) $ python train_tcga.py --dataset=TCGA-lung-default For TCGA, I got the same attention score with @HHHedo , I don't know why at the first epoch, the score is so high. You can see my screenshots.

屏幕快照 2022-11-03 下午8 40 23

... and after the 3rd epoch, there is no other better model to be solved. That's very confused me.

屏幕快照 2022-11-03 下午8 40 54

Could you tell me why and how to fix it? Thank you very much.

Originally posted by @xiaozhu0816 in https://github.com/binli123/dsmil-wsi/issues/59#issuecomment-1302047914

HHHedo commented 1 year ago

You may try a smaller weight decay.

xiaozhu0816 commented 1 year ago

You may try a smaller weight decay.

Thank you for your advice. But I'm not sure why the best results can be obtained in the first few epochs. For TCGA, may I ask what is result after you have reduced the weight decay? What were the results in the first few epochs and what was the best result after the final convergence? Because I think average score in here is Accuracy in the paper. So they are similar?

Aother question is how did you test? I remembered you mentioned that you got the test result. But for me, I followed the instruction $ python download.py --dataset=tcga-test then $ python test_crop_single.py --dataset=tcga and $ python testing_tcga.py. In ./test/input there are only 6 wsi for me to test, and in testing_tcga.pyThere are only printed text results, not numerical results like average scores or anything else.

Thank you very much!

binli123 commented 1 year ago

You may try a smaller weight decay.

Thank you for your advice. But I'm not sure why the best results can be obtained in the first few epochs. For TCGA, may I ask what is result after you have reduced the weight decay? What were the results in the first few epochs and what was the best result after the final convergence? Because I think average score in here is Accuracy in the paper. So they are similar?

Aother question is how did you test? I remembered you mentioned that you got the test result. But for me, I followed the instruction $ python download.py --dataset=tcga-test then $ python test_crop_single.py --dataset=tcga and $ python testing_tcga.py. In ./test/input there are only 6 wsi for me to test, and in testing_tcga.pyThere are only printed text results, not numerical results like average scores or anything else.

Thank you very much!

For the TCGA dataset, the model converges very quickly because a large portion of the region in the slides are positive regions. Regarding testing, you can copy and paste the testing slides to the ./test/input folder. The main purpose of the testing script is to print attention maps.

xiaozhu0816 commented 1 year ago

You may try a smaller weight decay.

Thank you for your advice. But I'm not sure why the best results can be obtained in the first few epochs. For TCGA, may I ask what is result after you have reduced the weight decay? What were the results in the first few epochs and what was the best result after the final convergence? Because I think average score in here is Accuracy in the paper. So they are similar? Aother question is how did you test? I remembered you mentioned that you got the test result. But for me, I followed the instruction $ python download.py --dataset=tcga-test then $ python test_crop_single.py --dataset=tcga and $ python testing_tcga.py. In ./test/input there are only 6 wsi for me to test, and in testing_tcga.pyThere are only printed text results, not numerical results like average scores or anything else. Thank you very much!

For the TCGA dataset, the model converges very quickly because a large portion of the region in the slides are positive regions. Regarding testing, you can copy and paste the testing slides to the ./test/input folder. The main purpose of the testing script is to print attention maps.

Thank you for your answer. I also tried to train Camelyon16 with the same operation as I did in TCGA. It also converges very quickly in the first 10 epochs, and avg_score reached over 90%. Could you give me some advice?

What's more, I notice that in train_tcga.py, you put all WSIs(Both Camelyon16 & TCGA) as train_val dataset. But no separate test dataset for "TEST". Maybe that's different from what you wrote in your paper because in the paper you actually split them into the training set and test set. Am I right? Should I split the training set(Camelyon 270 WSI and TCGA 840WSI into train&val) and use the left (unseen) WSIs for the test?

binli123 commented 1 year ago

You may try a smaller weight decay.

Thank you for your advice. But I'm not sure why the best results can be obtained in the first few epochs. For TCGA, may I ask what is result after you have reduced the weight decay? What were the results in the first few epochs and what was the best result after the final convergence? Because I think average score in here is Accuracy in the paper. So they are similar? Aother question is how did you test? I remembered you mentioned that you got the test result. But for me, I followed the instruction $ python download.py --dataset=tcga-test then $ python test_crop_single.py --dataset=tcga and $ python testing_tcga.py. In ./test/input there are only 6 wsi for me to test, and in testing_tcga.pyThere are only printed text results, not numerical results like average scores or anything else. Thank you very much!

For the TCGA dataset, the model converges very quickly because a large portion of the region in the slides are positive regions. Regarding testing, you can copy and paste the testing slides to the ./test/input folder. The main purpose of the testing script is to print attention maps.

Thank you for your answer. I also tried to train Camelyon16 with the same operation as I did in TCGA. It also converges very quickly in the first 10 epochs, and avg_score reached over 90%. Could you give me some advice?

What's more, I notice that in train_tcga.py, you put all WSIs(Both Camelyon16 & TCGA) as train_val dataset. But no separate test dataset for "TEST". Maybe that's different from what you wrote in your paper because in the paper you actually split them into the training set and test set. Am I right? Should I split the training set(Camelyon 270 WSI and TCGA 840WSI into train&val) and use the left (unseen) WSIs for the test?

Yes, you should separate the testing set and test it after training. (Make sure you also exclude testing data in the self-supervised training phase if you are using self-supervised training. Or, you can use ImageNet pretrained CNN)

binli123 commented 3 months ago

I incorporated the training/testing into the same pipeline in the latest commit. This change allows you to read the evaluation results on a reserved test set. I also incorporated a simple weights initialization method which helps stabilize the training. You can set --eval_scheme=5-fold-cv-standalone-test which will perform a train/valid/test like this:

A standalone test set consisting of 20% samples is reserved, remaining 80% of samples are used to construct a 5-fold cross-validation. For each fold, the best model and corresponding threshold are saved. After the 5-fold cross-validation, 5 best models along with the corresponding optimal thresholds are obtained which are used to perform inference on the reserved test set. A final prediction for a test sample is the majority vote of the 5 models. For a binary classification, accuracy and balanced accuracy scores are computed. For a multi-label classification, hamming loss (smaller the better) and subset accuracy are computed.

You can also simply run a 5-fold cv --eval_scheme=5-fold-cv

There were some issues with the testing script when loading pretrained weights (i.e., sometimes the weights are not fully loaded or there are missing weights, setting strict=False can reveal the problems.). The purpose of the testing script is to generate the heatmap, you should now read the performance directly from the training script. I will fix the issues in a couple of days.