choosehappy / HistoQC

HistoQC is an open-source quality control tool for digital pathology slides
BSD 3-Clause Clear License
263 stars 105 forks source link

Add standalone TileExtractionModule and some misc fix #247

Closed CielAl closed 1 year ago

CielAl commented 1 year ago

Adding TileExtractionModule that locates each connected tissue region's tile bounding boxes, given the tile size and stride. Parameters:

Write:

Pytest: adding the corresponding testing function in test_pipeline_cli.py: image

Misc Fix encountered outside of the TileExtractionModule while running pytest: @choosehappy

  1. histoqc._import_openslide and also move the openslide binaries into a new "bin" folder, details see the issue: Change of behavior in DLL path searching in Python 3.8. Basically in python>=3.8, os.add_dll_directory must be used to locate the external libraries rather than just toggling the path variable.
  2. Simple sanitization of "size": https://github.com/choosehappy/HistoQC/blob/master/histoqc/BaseImage.py#L106. Reason: while the size variable is annotated as str, certain existing Modules (legacy) may not sanitize the inputs properly and therefore break the pipeline. For instance, the SaveModule uses a default integer "small_dim" parameter and it will raise the exception at [L122] (https://github.com/choosehappy/HistoQC/blob/master/histoqc/BaseImage.py#L122) which uses str methods.

Therefore, while a better solution is probably to define a protocol for all Modules, a simple fix, for now, is to sanitize and barricade invalid inputs in the BaseImage class.

nanli-emory commented 1 year ago

Hi @CielAl, There are some thoughts:

  1. TileExtractionModule saves all tissue tiles as images under the tiles folder which is overkill since the final tiles folder size could be larger than the slide files. For example, I ran 2 slides (HTT-TILS-001-26B.svs - 316.4 MB and TCGA-EJ-5518-01Z-00-DX1.svs - 480.9 MB) as tests. The size of HTT-TILS-001-26B.svs's tiles folder was 841 MB and The size of HTT-TILS-001-26B.svs's tiles folder was 4.81 GB. Obviously, It wastes time and space to save each of the tiles from the slide file since there is a tiles_coords_nested.json file to describe the position of each tile. I suggest only saving the tile-description JSON files. The users can read the tiles_coords_nested.json file with a slide file to retrieve each tissue tile.

  2. The tissue detection method still needs to improve since I can clearly see some misc-detected tissue tiles HTT-TILS-001-26B svs_tile_bbox

CielAl commented 1 year ago

Appreciate the feedback!!!

Hi @CielAl, There are some thoughts:

  1. TileExtractionModule saves all tissue tiles as images under the tiles folder which is overkill since the final tiles folder size could be larger than the slide files. For example, I ran 2 slides (HTT-TILS-001-26B.svs - 316.4 MB and TCGA-EJ-5518-01Z-00-DX1.svs - 480.9 MB) as tests. The size of HTT-TILS-001-26B.svs's tiles folder was 841 MB and The size of HTT-TILS-001-26B.svs's tiles folder was 4.81 GB. Obviously, It wastes time and space to save each of the tiles from the slide file since there is a tiles_coords_nested.json file to describe the position of each tile. I suggest only saving the tile-description JSON files. The users can read the tiles_coords_nested.json file with a slide file to retrieve each tissue tile.

Indeed! That's a primary reason the export-tile is optional and is controlled by the "save_image" bool flag. Perhaps for smaller slides (with fewer tissue regions) it may be feasible? Nonetheless, I can remove the explicit exportation of image tiles from the module since as you pointed out -- so long the users have the bboxes they can always retrieve the tiles using their own pipelines and save them into their preferred format (e.g., HDF5).

  1. The tissue detection method still needs to improve since I can clearly see some misc-detected tissue tiles HTT-TILS-001-26B svs_tile_bbox

This is more concerning. So far the TileExtractionModule uses the 'img_mask_use' stored in base image for processing. Could you also put the corresponding binary mask here so we can identify whether it's the mask that goes wrong or the extraction code missed certain regions?

nanli-emory commented 1 year ago

Appreciate the feedback!!!

Hi @CielAl, There are some thoughts:

  1. TileExtractionModule saves all tissue tiles as images under the tiles folder which is overkill since the final tiles folder size could be larger than the slide files. For example, I ran 2 slides (HTT-TILS-001-26B.svs - 316.4 MB and TCGA-EJ-5518-01Z-00-DX1.svs - 480.9 MB) as tests. The size of HTT-TILS-001-26B.svs's tiles folder was 841 MB and The size of HTT-TILS-001-26B.svs's tiles folder was 4.81 GB. Obviously, It wastes time and space to save each of the tiles from the slide file since there is a tiles_coords_nested.json file to describe the position of each tile. I suggest only saving the tile-description JSON files. The users can read the tiles_coords_nested.json file with a slide file to retrieve each tissue tile.

Indeed! That's a primary reason the export-tile is optional and is controlled by the "save_image" bool flag. Perhaps for smaller slides (with fewer tissue regions) it may be feasible? Nonetheless, I can remove the explicit exportation of image tiles from the module since as you pointed out -- so long the users have the bboxes they can always retrieve the tiles using their own pipelines and save them into their preferred format (e.g., HDF5).

Should set the default "save_image" bool flag as False?

  1. The tissue detection method still needs to improve since I can clearly see some misc-detected tissue tiles HTT-TILS-001-26B svs_tile_bbox

This is more concerning. So far the TileExtractionModule uses the 'img_mask_use' stored in base image for processing. Could you also put the corresponding binary mask here so we can identify whether it's the mask that goes wrong or the extraction code missed certain regions? Thanks for your clarification. I checked the corresponding binary mask. It is correct.

CielAl commented 1 year ago

Appreciate the feedback!!!

Hi @CielAl, There are some thoughts:

  1. TileExtractionModule saves all tissue tiles as images under the tiles folder which is overkill since the final tiles folder size could be larger than the slide files. For example, I ran 2 slides (HTT-TILS-001-26B.svs - 316.4 MB and TCGA-EJ-5518-01Z-00-DX1.svs - 480.9 MB) as tests. The size of HTT-TILS-001-26B.svs's tiles folder was 841 MB and The size of HTT-TILS-001-26B.svs's tiles folder was 4.81 GB. Obviously, It wastes time and space to save each of the tiles from the slide file since there is a tiles_coords_nested.json file to describe the position of each tile. I suggest only saving the tile-description JSON files. The users can read the tiles_coords_nested.json file with a slide file to retrieve each tissue tile.

Indeed! That's a primary reason the export-tile is optional and is controlled by the "save_image" bool flag. Perhaps for smaller slides (with fewer tissue regions) it may be feasible? Nonetheless, I can remove the explicit exportation of image tiles from the module since as you pointed out -- so long the users have the bboxes they can always retrieve the tiles using their own pipelines and save them into their preferred format (e.g., HDF5).

Should set the default "save_image" bool flag as False?

  1. The tissue detection method still needs to improve since I can clearly see some misc-detected tissue tiles HTT-TILS-001-26B svs_tile_bbox

This is more concerning. So far the TileExtractionModule uses the 'img_mask_use' stored in base image for processing. Could you also put the corresponding binary mask here so we can identify whether it's the mask that goes wrong or the extraction code missed certain regions? Thanks for your clarification. I checked the corresponding binary mask. It is correct.

Thanks for the feedback. I've changed the default value of save_image to False.