Boomwwe / SOTA_MSI_prediction

GNU General Public License v3.0
2 stars 0 forks source link

method to split train\vali\test set #3

Open 1803170327 opened 4 months ago

1803170327 commented 4 months ago

Hello!In training.py,a method named as cv_pic_list() is used to split datasets but not provided.So I implemented the code referring to the method with the same name in Pretrain.py.I mixed up all tiles and randomly splited them into 3 parts,corresponding to 3 datasets.I successfully trained my model and it got good performance.

But this method of spiltting datasets seems kind of weird...Because usually I split datasets at the patient level.For example,if I have 500 slides,I will randomly select 60%(300 slides) for training,20%(100 slides) for validation and the rest for tesing.But now I split datasets at th tile level.I mixed up all tiles which means different tiles from one patient may exist in train vali and test sets simultaneously.I feel this method is suspected of cheating.

From the paper,"Predicting microsatellite instability and key biomarkers in colorectal cancer from H&E-stained images: achieving state-of-the-art predictive performance with fewer data using Swin Transformer",I cannot determine a second way to split the dataset.Maybe I just misunderstood this article.

Please tell me whether my method is correct.If not,please explain why this method is OK and is not cheating. My code is attached below: image image

Boomwwe commented 4 months ago

Spliting datasets at the patient level is right and the other method is wrong and we don't use that. In the Pretrain.py it works because the dataset has been divided and are tiles already. That is what we did: In training.py, I first randomly generate the dict of the split of patients and save then use cv_pic_list to read them. It seems that some codes are missed. I'll try to fix it. Thank for reminding.

1803170327 commented 4 months ago

Spliting datasets at the patient level is right and the other method is wrong and we don't use that. In the Pretrain.py it works because the dataset has been divided and are tiles already. That is what we did: In training.py, I first randomly generate the dict of the split of patients and save then use cv_pic_list to read them. It seems that some codes are missed. I'll try to fix it. Thank for reminding.

Thanks!!!