Convert to zarr as an initial analysis step

talonchandler commented 1 year ago

All of the current analysis steps (deskew, estimate-deskew, estimate-bleaching) read the raw .ndtiff as input...a pragmatic choice for our initial analysis steps.

We're now reaching a point where this is not scaling as well as we'd like. We don't have (or I don't know about) good tools for parallelizing an operation over positions/timepoints in an NDTiff, so I am suggesting we make our first analysis step an iohub convert to zarr (for both labelfree and lightsheet data) so that we can more easily run parallel jobs over the raw data.

To close this issue, I will change our analysis steps (deskew, estimate-deskew, estimate-bleaching) to read single-position zarrs by default. I think this will give us the most flexibility to parallelize over positions using the hpc/multiprocessing.

@edyoshikun @ziw-liu @ieivanov please comment if you have any concerns or better approaches. My minor concern is duplication of data, but I think it's worth it...e.g. the 750 GB 2022-04-28 dataset will take several days to deskew without a better parallelization strategy.

ziw-liu commented 1 year ago

Related: #27, #33

mattersoflight commented 1 year ago

@talonchandler I also think that converting to zarr is the right first step to parallelize and standardize downstream analysis and visualization. At this point, we can be confident that we will capture all of the pixel data correctly.

It is worth thinking through how to capture all relevant metadata. A solution is to scrap all of the metadata provided by Micro-Manager and save it with zarr store.

ieivanov commented 1 year ago

I think this generally makes sense. Does iohub convert provide a way to convert only a portion of the data to zarr, say P=2, T=0? My quick reading says no. It will be good for us to have a way to get quick results, and having to convert 750 GB of data before we see reconstructions would be too much. The alternative is to load a portion of the data in a script and operate with numpy arrays rather than using the CLI.

iohub convert should still provide a way to only convert a portion of the data, I think. Sometimes acquisitions fail, so it's useful to only convert the first few positions or time points.

talonchandler commented 1 year ago

I 100% agree that an in-progress or partial conversion to zarr would be very useful. @ziw-liu can you comment on how feasible this is, especially for ndtiff datasets?

ziw-liu commented 1 year ago

I 100% agree that an in-progress or partial conversion to zarr would be very useful. @ziw-liu can you comment on how feasible this is, especially for ndtiff datasets?

Let's move this discussion to an iohub issue.

czbiohub-sf / shrimPy

Convert to zarr as an initial analysis step #42