czbiohub-sf / iohub

Pythonic and parallelizable I/O for N-dimensional imaging data with OME metadata
https://czbiohub-sf.github.io/iohub/
BSD 3-Clause "New" or "Revised" License
29 stars 7 forks source link

Initial setup #1

Closed ziw-liu closed 1 year ago

ziw-liu commented 1 year ago

After migrating the io module of waveorder (at https://github.com/mehta-lab/waveorder/commit/5f60f0ad27e05596a6f6cb09ef2310f6bc00f236) to this new repository, the next steps may include:

Where each of these can be elaborated/debated upon in spin-off issues

@mattersoflight @JoOkuma @royerloic @talonchandler @Christianfoley please feel free to add to or modify these objectives.

mattersoflight commented 1 year ago

I drafted the README. We should discuss the scope with the goal to reach an MVP and start to use the library in other repos. The library will evolve based on how it is used.

JoOkuma commented 1 year ago

One question about this item from the README

to ship data to deconvolution and DL pipelines.

Would a deep learning dataset class (i.e. a torch Dataset class) be part of this package? While a think it would be great to have shared deep-learning utilities I'm concerned about bloating this package's dependencies.

mattersoflight commented 1 year ago

to ship data to deconvolution and DL pipelines.

Would a deep learning dataset class (i.e. a torch Dataset class) be part of this package? While a think it would be great to have shared deep-learning utilities I'm concerned about bloating this package's dependencies.

I agree that this package should not provide a Dataset class. I was thinking of efficient thin wrappers around the data tree written by the library to enable succinct usage like this:

import iohub
import waveorder as wo
import dexp as dx
from microDL import trainer

# initialize reader
LF, LS = iohub.reader(<path to zarr store or TIFF directory>, format = 'mantis') 
reconstructions = iohub.writer(<path to reconstructions>)

# TCZYX order. LF, LS stores are acquired in different coordinate systems and stored within subfolders.

# reconstruct each time point and write.
for t in range(LF.shape[0])
    phase, retardance = wo.reconstruct(LF[t,::]) # 4 channels * XYZ
    nuclei = dx.deconvolve(dx.deskew(LS[t,0,::])) # 1 channel * XYZ
    membrane  = dx.deconvolve(dx.deskew(LS[t,1,::])) # 1 channel * XYZ
    reconstructions[t,0,::] = phase
    reconstructions[t,1,::] = retardance
    reconstructions[t,2,::] = nuclei
    reconstructions[t,3,::] = membrane

# train a model

trainer( input = reconstructions[t,0,::] ,  target = reconstructions[t,2,::], <config parameters or file>)  
# reconstructions  appear as zarr or dask objects to the DL pipeline.
mattersoflight commented 1 year ago

@ziw-liu , @JoOkuma , @talonchandler I recommend reading this preprint. This discussion related to chunked TIFF format is particularly relevant to the problem of passing data to analysis pipelines such as CellPose:

"Some success has been achieved with OME-TIFF, a 2D multi-resolution image format that captures acquisition metadata as OME-XML in the TIFF header 2,7,8. Reference software implementations are available in Java (https://github.com/ome/bioformats/), C++ (https://gitlab.com/codelibre/ome/ome-files-cpp) and Python (e.g., https://github.com/AllenCellModeling/aicsimageio, https://github.com/apeer-micro/apeer-ometiff-library, https://github.com/cgohlke/tifffile). OME-TIFF is supported by several commercial imaging companies (see https://www.openmicroscopy.org/commercial-partners/) and is the recommended format for public data projects like Image Data Resource (IDR) or Allen Institute of Cell Science, making their data available from https://open.quiltdata.com/b/allencell/.

As our and others’ use of existing tools for conversion to OME-TIFF grew, TIFF’s linear binary layout became a bottleneck. Larger files took increasingly long to write. This problem was most obvious in projects that required the conversion of large numbers of whole slide images from PFFs to OME-TIFF for use in data lakes that are used for AI training sets (https://pathlake.org/; https://icaird.com/). The need for a scalable conversion motivated our development of two tools, bioformats2raw (https://github.com/glencoesoftware/bioformats2raw) and raw2ometiff (https://github.com/glencoesoftware/raw2ometiff). Together they provide a parallel pipeline using Bio-Formats to convert any supported PFF into multi-resolution OME-TIFF. This is achieved by breaking images into atomic “chunks”, writing them independently to disk, and generating subresolutions from them when none are available, whereupon a second process can efficiently write these chunks into TIFF (Figure 1b)."

If we need to convert existing data to TIFF, we can write scripts that use some of the above tools and share them via iohub.

When the user wants to write data into TIFF, we can rely on tifffile.

ziw-liu commented 1 year ago

I feel that we now have a clear path towards these goals. Closing in favor of specific issues.