Dana-Farber-AIOS / pathml

Tools for computational pathology
https://pathml.org
GNU General Public License v2.0
387 stars 81 forks source link

Error in SlideDataset.run #126

Closed MohamedOmar2020 closed 3 years ago

MohamedOmar2020 commented 3 years ago

I am trying to run a preprocessing pipeline on a dataset of 70 CODEX images each with the shape of 1920, 1440, 1, 4, 23 using the following code:

dataset.run(pipe, distributed = True, client = client, tile_size=(1920, 1440), tile_pad=False)

But I am getting this error:

ValueError: could not broadcast input array from shape (4,1920,1440) into shape (1920,1440,4)

The command runs fine with using a single number for tile_size

jacob-rosenthal commented 3 years ago

This wasn't caught by our testing suite because we were only testing BioformatsBackend using a .tif files. However, the Bioformats behavior is different for a .qptiff image (shape is HWZCT instead of HWC). Added tests using .qptiff file in a new branch "issue_126"

@ryanccarelli Once the tests pass we should be good for a PR to fix this bug

MohamedOmar2020 commented 3 years ago

Thanks @jacobrosenthaldfci I also have another issue with the same command when using distributed = True. I receive the following error message:

TypeError: cannot pickle '_thread.RLock' object

jacob-rosenthal commented 3 years ago

I think that error is caused when the Pipeline contains a Transform that uses a TensorFlow model (i.e. Mesmer segmentation model). When you use distributed=True, Dask uses pickling to send data across the cluster. However, TF models can't be pickled, so that's why this error comes up.

We should add a better check for this so you get a more informative error message. In the meantime, you should probably use distributed=False

MohamedOmar2020 commented 3 years ago

The tile_size issue is now fixed, thanks!

jacob-rosenthal commented 3 years ago

This is fixed now by #132