Mismatch in the number of extracted patches compared to expected

Qing1Zhong commented 1 year ago

Hello @clemsgrs , I'm currently trying to replicate work done in Richard's HIPT project. One critical step involves segmenting raw histopathological slides at 20x magnification into a series of regions with dimensions [4096, 4096]. I chose to use hs2p to accomplish this task.

Steps to Reproduce:

I downloaded a raw slide (TCGA-2F-A9KO-01Z-00-DX1.195576CF-B739-4BD9-B15B-4A70AE287D3E.svs) from the TCGA database. Slide details are:

Magnification: 40 Level dimensions: ((135168, 105472), (33792, 26368), (8448, 6592), (2112, 1648)) Level downsamples: (1.0, 4.0, 16.0, 64.0)

Loaded a .pt file containing segmentation details (obtained from HIPT repository):

import torch
wsi_path = 'TCGA-2F-A9KO-01Z-00-DX1.195576CF-B739-4BD9-B15B-4A70AE287D3E.pt'
wsi = torch.load(wsi_path)
print(wsi.shape)

Output: torch.Size([30, 192])

Ran patch_extraction.py from hs2p.

Expected Result: I expected to get 30 patches of size [4096, 4096] based on the .pt file. Actual Result: The tiles.csv file generated has 398 coordinates, which is significantly different from what the .pt file from HIPT repository suggests.

I wonder if there's something wrong with my parameter settings or if there's any other reason for this discrepancy. Any insights would be greatly appreciated. Thank you!

Parameters Used:

slide_csv: '/home/xxx/CLAM/hs2p-master/slides.csv'
output_dir: 'output/debug' # folder where to save algorithm output
experiment_name: 'patch_extraction'
resume: False # whether or not to resume existing experiment

#backend: 'asap'
#backend: 'pyvips'
backend: 'openslide'

flags:
  patch: True # whether or not to extract patches from segmented tissue regions
  visu: True # whether or not to generate a .jpg image to visualize patching results
  verbose: False

seg_params:
  seg_level: -1 # downsample level on which to segment the WSI (-1 = uses the downsample level in the WSI closest to the following downsample parameter)
  downsample: 64 # if seg_level = -1, then uses this value to find the closest downsample level in the WSI for tissue segmentation computation
  sthresh: 8 # segmentation threshold (positive integer, using a higher threshold leads to less foreground and more background detection) (not used when use_otsu=True)
  mthresh: 7 # median filter size (positive, odd integer)
  close: 4 # additional morphological closing to apply following initial thresholding (positive integer)
  use_otsu: False # use otsu's method instead of simple binary thresholding
  save_mask: False # save tissue mask to disk as a .tif image
  visualize_mask: True # save a visualization of the tissue mask as a .jpg image
  tissue_pixel_value: 1 # value of tissue pixel in pre-computed segmentation masks

filter_params:
  ref_patch_size: 256 # reference patch size at spacing patch_params.spacing
  a_t: 100 # area filter threshold for tissue (positive integer, the minimum size of detected foreground contours to consider, relative to the reference patch size ref_patch_size, e.g. a value 10 means only detected foreground contours of size greater than 10 [ref_patch_size, ref_patch_size] sized patches at spacing patch_params.spacing will be processed)
  a_h: 16 # area filter threshold for holes (positive integer, the minimum size of detected holes/cavities in foreground contours to avoid, once again relative to the reference patch size ref_patch_size)
  max_n_holes: 10 # maximum of holes to consider per detected foreground contours (positive integer, higher values lead to more accurate patching but increase computational cost ; keeps the biggest holes)

vis_params:
  vis_level: -1 # downsample level to visualize the segmentation results (-1 = uses the downsample level in the WSI closest to the following downsample parameter)
  downsample: 64 # if vis_level = -1, then uses this value to find the closest downsample level in the WSI for tissue segmentation visualization
  downscale: 64 # downsample to visualize the result of patch extraction
  line_thickness: 200 # line thickness to draw the segmentation results (positive integer)

patch_params:
  spacing: 0.5 # pixel spacing (in micron/pixel) at which patches should be extracted (will find the level with spacing the closest to this value)
  patch_size: 4096 # patch size at previous pixel spacing
  overlap: 0.0 # percentage of overlap between two consecutive patches (float between 0 and 1)
  use_padding: True # whether to pad the border of the slide
  contour_fn: 'pct' # contour checking function to decide whether a patch should be considered foreground or background (choices between 'pct' - checks if the given patch has enough tissue using the following parameter as decision threshold, 'four_pt' - checks if all four points in a small grid around the center of the patch are inside the contour, 'center' - checks if the center of the patch is inside the contour, 'basic' - checks if the top-left corner of the patch is inside the contour)
  tissue_thresh: 0.1 # if contour_fn = 'pct', threshold used to filter out patches that have less tissue than this value (percentage)
  drop_holes: False # whether or not to drop patches whose center pixel falls withing an identified holes
  save_patches_to_disk: False # whether or not to save patches as images to disk
  format: 'jpg' # if save_patches_to_disk = True, then saves patches in this file format
  draw_grid: True # whether to draw the patch grid when visualizing patching results
  grid_thickness: 1 # sets the grid thickness ((in px) when visualizing patching results (256: 1, 4096: 2)
  bg_color: # which (r,g,b) values should be used to represent background when visualizing patching results
    - 214
    - 233
    - 238

speed:
  multiprocessing: True
  num_workers: 10 # number of process to start in parallel

wandb:
  enable: False
  project: 'hs2p'
  exp_name: '${experiment_name}'
  username: 'clemsg'
  dir: '/home/user'
  group:
  tags: []

# hydra
hydra:
  run:
    dir: /tmp/hydra_output

Additional Question Regarding HIPT Replication: I noticed that you have successfully replicated the HIPT project. I have a question concerning the selection of histopathological slides for the self-supervised training in HIPT.

According to the paper, a total of 10,678 slides were used for training. It's clear that some slides from the TCGA database were discarded. Taking TCGA-BRCA as an example, the dataset has 1,133 slides, but I only found 1,038 .pt files for TCGA-BRCA in the HIPT repository. This indicates that close to 100 slides were not used.

Could you please shed some light on the criteria used for discarding certain slides? I'm curious to understand the rationale behind this selection process.

Thank you very much for your time and assistance.

Willing to Discuss Further: I'm very interested in your work and would love to discuss it further. If you're open to it, could we perhaps continue this discussion via email? My email address is [1179152040@qq.com]. I look forward to potentially collaborating or at least learning more about your research and the hs2p project.

Thank you once again for your time and your contributions to the community.

clemsgrs commented 1 year ago

hi @Qing1Zhong,

I tried to reproduce the hs2p results on slide TCGA-2F-A9KO-01Z-00-DX1.195576CF-B739-4BD9-B15B-4A70AE287D3E with the parameters that you shared. I indeed get 398 patches.

A good place to start to debug results is to have a look at the generated visualisation:

TCGA-2F-A9KO-01Z-00-DX1 195576CF-B739-4BD9-B15B-4A70AE287D3E

I couldn't see anything going really wrong based on the visualisation, so I manually checked some properties of the slide. Unfortunately, the slide only has the following spacings: [0.228, 0.911, 3.643, 14.573] Basically, the 0.5 spacing is missing.

The patches are extracted at the spacing which is the closest to the spacing value specified as parameter. When specifying 0.50 as parameter, it will actually extract patches at 0.228, which is the closest slide spacing. As we work at 0.228 mpp, a (4096, 4096) pixels region covers an area of approximately 872 147 microns squared. If we had worked at 0.50 mpp instead, the same region would have covered an area of 4 194 304 microns squared. That's approx. 5 times bigger: one (4096, 4096) region extracted at 0.50 mpp is equivalent to five (4096, 4096) regions extracted at 0.228 mpp. Hence why we end up with (much) more regions than HIPT authors.

In theory, the following warning should have popped up when running hs2p:

WARNING! The closest natural spacing to the target spacing was more than 20.0% appart.

But with changes I introduced recently, the warning may have not popped up (i'll try to fix it).

What could be a solution? The easiest thing that comes to my mind is the following:

1- identify all slides that are missing the 0.5 spacing 2- extract (8192, 8192) regions at 0.228 mpp for these slides 3- resize them to (4096, 4096) to mimic the expected downsampling

I didn't face this problem as I only tried to reproduce the TCGA-BRCA results (i.e. using breast slides). It seems all TCGA-BRCA slides had a spacing close to 0.50 mpp.

Regarding your additional question: it's not clear to me either what happened to some slides. Someone had raised a similar question on the official repo (https://github.com/mahmoodlab/HIPT/issues/6#issuecomment-1175787362). I've listed hereunder some further possible explanations:

as stated in the answer to the issue linked above, this could be due to patching irregularities : slides with insufficient tissue content for patching get excluded
I don't think any slides were excluded because used for pretraining (hence cannot be used for downstream training/tuning in theory) as the authors state in the conclusion of their paper "ViT256-16 pretraining performed on almost all of TCGA and evaluation lacking independent test cohorts"

Let me know if this answers your questions!

Qing1Zhong commented 1 year ago

hi @Qing1Zhong,

I tried to reproduce the hs2p results on slide TCGA-2F-A9KO-01Z-00-DX1.195576CF-B739-4BD9-B15B-4A70AE287D3E with the parameters that you shared. I indeed get 398 patches.

A good place to start to debug results is to have a look at the generated visualisation:开始调试结果的一个好地方是查看生成的可视化：

I couldn't see anything going really wrong based on the visualisation, so I manually checked some properties of the slide.根据可视化效果，我看不出有什么真正错误，因此我手动检查了幻灯片的一些属性。 Unfortunately, the slide only has the following spacings: [0.228, 0.911, 3.643, 14.573]不幸的是，幻灯片只有以下间距： [0.228, 0.911, 3.643, 14.573] Basically, the 0.5 spacing is missing. 基本上，缺少 0.5 间距。

The patches are extracted at the spacing which is the closest to the spacing value specified as parameter.以最接近指定为参数的间距值的间距提取补丁。 When specifying 0.50 as parameter, it will actually extract patches at 0.228, which is the closest slide spacing.当指定 0.50 作为参数时，它实际上会在 0.228 处提取补丁，这是最接近的幻灯片间距。 As we work at 0.228 mpp, a (4096, 4096) pixels region covers an area of approximately 872 147 microns squared. 当我们在 0.228 mpp 下工作时，(4096, 4096) 像素区域覆盖大约 872 147 微米平方的区域。 If we had worked at 0.50 mpp instead, the same region would have covered an area of 4 194 304 microns squared. That's approx. 5 times bigger: one (4096, 4096) region extracted at 0.50 mpp is equivalent to five (4096, 4096) regions extracted at 0.228 mpp. Hence why we end up with (much) more regions than HIPT authors. 如果我们在 0.50 mpp 下工作，同一区域将覆盖 4 194 304 微米平方的面积。那是大约。 5 倍：在 0.50 mpp 处提取的 1 个 (4096, 4096) 区域相当于在 0.228 mpp 处提取的 5 个 (4096, 4096) 区域。这就是为什么我们最终得到的区域比 HIPT 作者多得多。

In theory, the following warning should have popped up when running hs2p:理论上，运行hs2p时应该会弹出以下警告：

WARNING! The closest natural spacing to the target spacing was more than 20.0% appart.

But with changes I introduced recently, the warning may have not popped up (i'll try to fix it).但随着我最近引入的更改，警告可能不会弹出（我会尝试修复它）。

What could be a solution? The easiest thing that comes to my mind is the following:有什么解决办法吗？我想到的最简单的事情如下：

1- identify all slides that are missing the 0.5 spacing1- 识别所有缺少 0.5 间距的幻灯片 2- extract (8192, 8192) regions at 0.228 mpp for these slides2- 在这些幻灯片的 0.228 mpp 处提取 (8192, 8192) 区域 3- resize them to (4096, 4096) to mimic the expected downsampling 3-将它们的大小调整为 (4096, 4096) 以模仿预期的下采样

I didn't face this problem as I only tried to reproduce the TCGA-BRCA results (i.e. using breast slides).我没有遇到这个问题，因为我只是尝试重现 TCGA-BRCA 结果（即使用乳房切片）。 It seems all TCGA-BRCA slides had a spacing close to 0.50 mpp.似乎所有 TCGA-BRCA 幻灯片的间距都接近 0.50 mpp。

Regarding your additional question: it's not clear to me either what happened to some slides. Someone had raised a similar question on the official repo (mahmoodlab/HIPT#6 (comment)).关于你的附加问题：我也不清楚一些幻灯片发生了什么。有人在官方仓库中提出了类似的问题（mahmoodlab/HIPT#6（评论））。 I've listed hereunder some further possible explanations:我在下面列出了一些进一步可能的解释：

as stated in the answer to the issue linked above, this could be due to patching irregularities : slides with insufficient tissue content for patching get excluded正如上面链接问题的答案中所述，这可能是由于修补不规则造成的：组织内容不足以修补的载玻片被排除

I don't think any slides were excluded because used for pretraining (hence cannot be used for downstream training/tuning in theory) as the authors state in the conclusion of their paper "ViT256-16 pretraining performed on almost all of TCGA and evaluation lacking independent test cohorts"我认为没有任何幻灯片因为用于预训练而被排除（因此理论上不能用于下游训练/调整），正如作者在论文的结论中指出的那样“ViT256-16 预训练几乎在所有 TCGA 上进行，并且缺乏评估”独立测试队列"

Let me know if this answers your questions!

Thank you very much for your prompt and detailed reponse! It has greatly helped in clarifying my concerns and resolving the issues I was facing. I appreciate the time and effort you put into addressing my queries.🙂🙂🙂

clemsgrs / hs2p

Mismatch in the number of extracted patches compared to expected #13