Closed player1321 closed 3 years ago
Hi, I have the same issue here. It seems that the histomicstk is not working at the moment. But it's odd that it has been down at least for three days now. @kheffah can you please confirm if this is the issue? Thanks,
Dear @player1321 and @mostafajahanifar Thank you for raising this issue. It seems the Kitware server has been down a few times lately. To avoid disruptions, now the dataset can also be downloaded directly at 0.25 MPP, using this direct link.
Dear @player1321 and @mostafajahanifar Thank you for raising this issue. It seems the Kitware server has been down a few times lately. To avoid disruptions, now the dataset can also be downloaded directly at 0.25 MPP, using this direct link.
That is also color normalized, Wonderful! Thanks for the prompt reply Mohamed.
Sorry @kheffah, I understand you have closed this issue, but can you kindly provide the information on the train/validation/test splits as well.
@mostafajahanifar You are more than welcome. I like to separate training and testing sets by hospital for a better reflection of the external generalization of the model. The train/test split used for the model in our paper is discussed here. Recently, I've switched to internal-external cross-validation, where the hospitals that constitute the testing set are switched around to provide some variance around the accuracy metric -- for example, see how we split the train-test sets for the NuCLS paper here. Note that some hospitals have more slides than others. In my recent projects, I make sure each fold has at least one "big" hospital, having at least 9 slides. Note that the slide name encodes the hospital name, so the slide TCGA-E2-A14X-DX1, for example, comes from the hospital E2.
I hope this answers your question. Let me know if you need any clarifications.
@mostafajahanifar You are more than welcome. I like to separate training and testing sets by hospital for a better reflection of the external generalization of the model. The train/test split used for the model in our paper is discussed here. Recently, I've switched to internal-external cross-validation, where the hospitals that constitute the testing set are switched around to provide some variance around the accuracy metric -- for example, see how we split the train-test sets for the NuCLS paper here. Note that some hospitals have more slides than others. In my recent projects, I make sure each fold has at least one "big" hospital, having at least 9 slides. Note that the slide name encodes the hospital name, so the slide TCGA-E2-A14X-DX1, for example, comes from the hospital E2.
I hope this answers your question. Let me know if you need any clarifications.
Thank you very much @kheffah for your detailed explanations. I cannot agree more with the new internal-external cross-validation scheme that you are taking. However, for this particular purpose I want to compare my model performance with your baseline for which you directed me to the related information. Again, I appreciate your help.
@mostafajahanifar You're more than welcome. Let me know if you need anything else.
@player1321 and @kheffah , just to let you know, I have contacted the Kitware people and they have fixed the problem with the website. So, I guess it's problem-free to use the code for dataset extraction now.
@mostafajahanifar Thank you! That was very nice of you.
Hello,
I ran
python download_crowdsource_dataset.py
but got 502 Bad Gateway:Is the data not available now?