HKU-MedAI / WSI-HGNN

[CVPR'23] Histopathology Whole Slide Image Analysis with Heterogeneous Graph Representation Learning
72 stars 7 forks source link

Inquiry about the origin of the datasets #8

Closed cjs6211 closed 7 months ago

cjs6211 commented 9 months ago

I would like to express my sincere appreciation for your remarkable achievement. I am truly impressed with the methods you have employed, and I am eager to replicate your work for application to other types of cancers. However, I have a crucial inquiry regarding the dataset mentioned in your paper.

Upon reviewing the TCGA data portal, I noticed the absence of "normal" cases in BRCA and COAD. Specifically, in the case of BRCA, there are only abnormal cases, such as ductal and lobular neoplasms, cystic, mucinous, and serous neoplasms, complex epithelial neoplasms, among others, totaling 1098 case IDs. Similarly, for COAD, there are only abnormal cases, including adenomas and adenocarcinomas, cystic, mucinous, and serous neoplasms, complex epithelial neoplasms, with a total of 461 case IDs.

The data in "typing_BRCA.txt" lacks information about normal cases, and there is no data related to COAD in the designated data folder. Additionally, the number of whole-slide images (WSI) does not align with the data currently accessible in the TCGA data portal. Therefore, it is essential for you to provide all case IDs and accurate labels used in your work to validate the robustness of your method.

Furthermore, it appears that you utilized all WSIs without distinguishing between diagnosis and tissue slides. I would like to bring to your attention that such an approach is not common in the field of pathology. It would be beneficial for you to clarify and justify this aspect of your methodology. I look forward to your prompt response and appreciate your understanding in addressing these concerns.

howardchanth commented 9 months ago

Hi thank you for your appreciation of our work

Since we are dealing with cancer subtyping, we are focusing on the cancer slides only, so the normal cases are excluded in this label file. We select all the cases with slides_tissue_normal from the sample_type column in the bipspeciman_sample file in the TCGA biomedical data sub-repository. An example TCGA-A6-2675-11A where 11A represents slides tissue normal and 01A represents primary tumor (where we label them as tumour slides). Please see #5 for more discussions.

We excluded some slides with very small number of non-background patches (i.e., trivial cases). Hence the number of slides could be smaller from the TCGA repository. However, all slides in training have been indexed by their respective case IDs.

And thank you for mentioning the difference. From the view of computer vision the diagnosis slides and tissue slides are similar. Hence, we anticipate the performance will not differ much if we divide the slides into diagnosis and tissue and train respective models. Besides all baseline methods are compared under these settings so the comparison shall be apple-to-apple.

I hope this addresses your concerns. Let me know if there are further questions. And thank you again for your interest in our work