AllenNeuralDynamics / aind-smartspim-stitch

Stitching and fusion pipeline in the cloud
MIT License
3 stars 1 forks source link

Channel organization and OME-Zarr #29

Open miketaormina opened 1 year ago

miketaormina commented 1 year ago

I wanted to start a discussion about how we organize data and how it conforms to the OME-Zarr specification. Namely, I wanted to ask "should our spectral channels continue to be independent zarr files, or should they be on the c axis of a single NGFF spec file?"

This, of course, doesn't need to be a priority, but I think the discussion is worthwhile. I can foresee the following pros/cons of consolidating the volume in this way:

Pros:

Cons:

I think that @dyf may have an opinion about the data organization/packaging aspects and @camilolaiton would have insight into the complexity of implementation.

camilolaiton commented 1 year ago

Hello @miketaormina, thanks for starting this conversation. In terms of implementation and the mentioned cons, I can share the following:

Additionally, I would like to say that there are plans to further optimize the fusion step to decrease computation times using the Code Ocean Pipeline feature. This would be feasible only with a zarr dataset per channel since the idea is to use the registration transforms from the alignment step, copy this to N instances ( N = len(channels) ), and fuse them in parallel.

Finally, I would love moving forward the multichannel OMEZarr dataset since this seems to be the right thing to do to match the NGFF specs. Nevertheless, this would to take some weeks.

@sharmishtaa @dyf

dyf commented 1 year ago

Yes, we should do this in the next revision of the pipeline.

miketaormina commented 1 year ago

Thanks @camilolaiton , this is good background info and makes sense.

For what it's worth, I don't think you'll have a problem with multiple instances writing to the same zarr file, since it's actually a directory and not a file. That is, unless by instance you are referring to capsules or if the AWS angle changes things. The format is pretty explicitly capable of this, as long as you're not writing to the same chunk. This is based on my understanding of zarr (here and here) and not any understanding of the implemented pipeline though. I know you're using dask arrays, which might complicate it if you call the to_zarr() method.

Edit- of course I somehow glazed over this from a link above, so my bad: Zarr arrays have not been designed for situations where multiple readers and writers are concurrently operating on the same array.

Edit2- not to ramble too much, but the documentation is a little confusing on this point: does the above apply to everything under the .zarray file, or are they referring to chunk when they say array? The sentence immediately before this one implies it's possible: By data sink we mean that multiple concurrent write operations may occur, with each writer updating a different region of the array.

camilolaiton commented 6 months ago

Do we have new opinions about this? @dyf @miketaormina

The new SmartSPIM pipeline follows what I mentioned here:

Additionally, I would like to say that there are plans to further optimize the fusion step to decrease computation times using the Code Ocean Pipeline feature. This would be feasible only with a zarr dataset per channel since the idea is to use the registration transforms from the alignment step, copy this to N instances ( N = len(channels) ), and fuse them in parallel.

At the moment, multiple computation instances are not able to communicate between them in Code Ocean but this could be something to look for in the future. However, this does not seem to be the priority at the moment.