Closed Qing1Zhong closed 1 year ago
hi @Qing1Zhong,
I tried to reproduce the hs2p results on slide TCGA-2F-A9KO-01Z-00-DX1.195576CF-B739-4BD9-B15B-4A70AE287D3E
with the parameters that you shared. I indeed get 398 patches.
A good place to start to debug results is to have a look at the generated visualisation:
I couldn't see anything going really wrong based on the visualisation, so I manually checked some properties of the slide.
Unfortunately, the slide only has the following spacings: [0.228, 0.911, 3.643, 14.573]
Basically, the 0.5
spacing is missing.
The patches are extracted at the spacing which is the closest to the spacing value specified as parameter.
When specifying 0.50
as parameter, it will actually extract patches at 0.228
, which is the closest slide spacing.
As we work at 0.228
mpp, a (4096, 4096) pixels region covers an area of approximately 872 147
microns squared.
If we had worked at 0.50
mpp instead, the same region would have covered an area of 4 194 304
microns squared. That's approx. 5
times bigger: one (4096, 4096) region extracted at 0.50
mpp is equivalent to five (4096, 4096) regions extracted at 0.228
mpp. Hence why we end up with (much) more regions than HIPT authors.
In theory, the following warning should have popped up when running hs2p:
WARNING! The closest natural spacing to the target spacing was more than 20.0% appart.
But with changes I introduced recently, the warning may have not popped up (i'll try to fix it).
What could be a solution? The easiest thing that comes to my mind is the following:
1- identify all slides that are missing the 0.5
spacing
2- extract (8192, 8192) regions at 0.228
mpp for these slides
3- resize them to (4096, 4096) to mimic the expected downsampling
I didn't face this problem as I only tried to reproduce the TCGA-BRCA results (i.e. using breast slides).
It seems all TCGA-BRCA slides had a spacing close to 0.50
mpp.
Regarding your additional question: it's not clear to me either what happened to some slides. Someone had raised a similar question on the official repo (https://github.com/mahmoodlab/HIPT/issues/6#issuecomment-1175787362). I've listed hereunder some further possible explanations:
Let me know if this answers your questions!
hi @Qing1Zhong,
I tried to reproduce the hs2p results on slide
TCGA-2F-A9KO-01Z-00-DX1.195576CF-B739-4BD9-B15B-4A70AE287D3E
with the parameters that you shared. I indeed get 398 patches.A good place to start to debug results is to have a look at the generated visualisation:开始调试结果的一个好地方是查看生成的可视化:
I couldn't see anything going really wrong based on the visualisation, so I manually checked some properties of the slide.根据可视化效果,我看不出有什么真正错误,因此我手动检查了幻灯片的一些属性。 Unfortunately, the slide only has the following spacings:
[0.228, 0.911, 3.643, 14.573]
不幸的是,幻灯片只有以下间距:[0.228, 0.911, 3.643, 14.573]
Basically, the0.5
spacing is missing. 基本上,缺少0.5
间距。The patches are extracted at the spacing which is the closest to the spacing value specified as parameter.以最接近指定为参数的间距值的间距提取补丁。 When specifying
0.50
as parameter, it will actually extract patches at0.228
, which is the closest slide spacing.当指定0.50
作为参数时,它实际上会在0.228
处提取补丁,这是最接近的幻灯片间距。 As we work at0.228
mpp, a (4096, 4096) pixels region covers an area of approximately872 147
microns squared. 当我们在0.228
mpp 下工作时,(4096, 4096) 像素区域覆盖大约872 147
微米平方的区域。 If we had worked at0.50
mpp instead, the same region would have covered an area of4 194 304
microns squared. That's approx.5
times bigger: one (4096, 4096) region extracted at0.50
mpp is equivalent to five (4096, 4096) regions extracted at0.228
mpp. Hence why we end up with (much) more regions than HIPT authors. 如果我们在0.50
mpp 下工作,同一区域将覆盖4 194 304
微米平方的面积。那是大约。5
倍:在0.50
mpp 处提取的 1 个 (4096, 4096) 区域相当于在0.228
mpp 处提取的 5 个 (4096, 4096) 区域。这就是为什么我们最终得到的区域比 HIPT 作者多得多。In theory, the following warning should have popped up when running hs2p:理论上,运行hs2p时应该会弹出以下警告:
WARNING! The closest natural spacing to the target spacing was more than 20.0% appart.
But with changes I introduced recently, the warning may have not popped up (i'll try to fix it).但随着我最近引入的更改,警告可能不会弹出(我会尝试修复它)。
What could be a solution? The easiest thing that comes to my mind is the following:有什么解决办法吗?我想到的最简单的事情如下:
1- identify all slides that are missing the
0.5
spacing1- 识别所有缺少0.5
间距的幻灯片 2- extract (8192, 8192) regions at0.228
mpp for these slides2- 在这些幻灯片的0.228
mpp 处提取 (8192, 8192) 区域 3- resize them to (4096, 4096) to mimic the expected downsampling 3-将它们的大小调整为 (4096, 4096) 以模仿预期的下采样I didn't face this problem as I only tried to reproduce the TCGA-BRCA results (i.e. using breast slides).我没有遇到这个问题,因为我只是尝试重现 TCGA-BRCA 结果(即使用乳房切片)。 It seems all TCGA-BRCA slides had a spacing close to
0.50
mpp.似乎所有 TCGA-BRCA 幻灯片的间距都接近0.50
mpp。Regarding your additional question: it's not clear to me either what happened to some slides. Someone had raised a similar question on the official repo (mahmoodlab/HIPT#6 (comment)).关于你的附加问题:我也不清楚一些幻灯片发生了什么。有人在官方仓库中提出了类似的问题(mahmoodlab/HIPT#6(评论))。 I've listed hereunder some further possible explanations:我在下面列出了一些进一步可能的解释:
- as stated in the answer to the issue linked above, this could be due to patching irregularities : slides with insufficient tissue content for patching get excluded正如上面链接问题的答案中所述,这可能是由于修补不规则造成的:组织内容不足以修补的载玻片被排除
- I don't think any slides were excluded because used for pretraining (hence cannot be used for downstream training/tuning in theory) as the authors state in the conclusion of their paper "ViT256-16 pretraining performed on almost all of TCGA and evaluation lacking independent test cohorts"我认为没有任何幻灯片因为用于预训练而被排除(因此理论上不能用于下游训练/调整),正如作者在论文的结论中指出的那样“ViT256-16 预训练几乎在所有 TCGA 上进行,并且缺乏评估”独立测试队列"
Let me know if this answers your questions!
Thank you very much for your prompt and detailed reponse! It has greatly helped in clarifying my concerns and resolving the issues I was facing. I appreciate the time and effort you put into addressing my queries.🙂🙂🙂
Hello @clemsgrs , I'm currently trying to replicate work done in Richard's HIPT project. One critical step involves segmenting raw histopathological slides at 20x magnification into a series of regions with dimensions [4096, 4096]. I chose to use hs2p to accomplish this task.
Steps to Reproduce:
Output:
torch.Size([30, 192])
Expected Result: I expected to get 30 patches of size [4096, 4096] based on the .pt file. Actual Result: The tiles.csv file generated has 398 coordinates, which is significantly different from what the .pt file from HIPT repository suggests.
I wonder if there's something wrong with my parameter settings or if there's any other reason for this discrepancy. Any insights would be greatly appreciated. Thank you!
Parameters Used:
Additional Question Regarding HIPT Replication: I noticed that you have successfully replicated the HIPT project. I have a question concerning the selection of histopathological slides for the self-supervised training in HIPT.
According to the paper, a total of 10,678 slides were used for training. It's clear that some slides from the TCGA database were discarded. Taking TCGA-BRCA as an example, the dataset has 1,133 slides, but I only found 1,038 .pt files for TCGA-BRCA in the HIPT repository. This indicates that close to 100 slides were not used.
Could you please shed some light on the criteria used for discarding certain slides? I'm curious to understand the rationale behind this selection process.
Thank you very much for your time and assistance.
Willing to Discuss Further: I'm very interested in your work and would love to discuss it further. If you're open to it, could we perhaps continue this discussion via email? My email address is [1179152040@qq.com]. I look forward to potentially collaborating or at least learning more about your research and the hs2p project.
Thank you once again for your time and your contributions to the community.