Channel organization and OME-Zarr

miketaormina commented 1 year ago

I wanted to start a discussion about how we organize data and how it conforms to the OME-Zarr specification. Namely, I wanted to ask "should our spectral channels continue to be independent zarr files, or should they be on the c axis of a single NGFF spec file?"

This, of course, doesn't need to be a priority, but I think the discussion is worthwhile. I can foresee the following pros/cons of consolidating the volume in this way:

Pros:

Better presentation in neuroglancer (brightness/contrast for each channel in one place, as opposed to treating each as a separate image) (I think?)
Better discovery of available channels with data processing and analysis scripts. Is this true? Right now I would do a glob('*Ex*.zarr') type of search within a parent folder and parse the file name for channel identity, as opposed to looking within the metadata of the file to get channel identities.
Follows the NGFF specification

Cons:

Data organization is less human-readable, I guess. Channels will be in 0, 1, 2, . . . folders instead of Ex_488_Em_525.zarr, . . . folders.
This will take a (perhaps not small) amount of work within the capsule

I think that @dyf may have an opinion about the data organization/packaging aspects and @camilolaiton would have insight into the complexity of implementation.

camilolaiton commented 1 year ago

Hello @miketaormina, thanks for starting this conversation. In terms of implementation and the mentioned cons, I can share the following:

When we were writing the code for the OMEZarr conversion, we were able to generate a multichannel volume following the NGFF guidelines. I was able to visualize the OMEZarr in napari without problems but not in neuroglancer. For some reason, the data looked skewed, I checked the metadata and everything was correct. I believe it's a neuroglancer issue. Since at that time scientists needed to have the data as soon as possible, we started generating a zarr dataset per channel which, at the end of the day, was not a bad idea in terms of data organization (from my perspective).
In the case where we fix the visualization issue and move forward generating a multichannel volume, we also need to update the post-processing algorithms (CCF registration, cell segmentation and quantification) since we assumed that a channel is an independent zarr dataset.

Additionally, I would like to say that there are plans to further optimize the fusion step to decrease computation times using the Code Ocean Pipeline feature. This would be feasible only with a zarr dataset per channel since the idea is to use the registration transforms from the alignment step, copy this to N instances ( N = len(channels) ), and fuse them in parallel.

Finally, I would love moving forward the multichannel OMEZarr dataset since this seems to be the right thing to do to match the NGFF specs. Nevertheless, this would to take some weeks.

@sharmishtaa @dyf

dyf commented 1 year ago

Yes, we should do this in the next revision of the pipeline.

miketaormina commented 1 year ago

Thanks @camilolaiton , this is good background info and makes sense.

For what it's worth, I don't think you'll have a problem with multiple instances writing to the same zarr file, since it's actually a directory and not a file. That is, unless by instance you are referring to capsules or if the AWS angle changes things. The format is pretty explicitly capable of this, as long as you're not writing to the same chunk. This is based on my understanding of zarr (here and here) and not any understanding of the implemented pipeline though. I know you're using dask arrays, which might complicate it if you call the to_zarr() method.

Edit- of course I somehow glazed over this from a link above, so my bad: Zarr arrays have not been designed for situations where multiple readers and writers are concurrently operating on the same array.

Edit2- not to ramble too much, but the documentation is a little confusing on this point: does the above apply to everything under the .zarray file, or are they referring to chunk when they say array? The sentence immediately before this one implies it's possible: By data sink we mean that multiple concurrent write operations may occur, with each writer updating a different region of the array.

camilolaiton commented 6 months ago

Do we have new opinions about this? @dyf @miketaormina

The new SmartSPIM pipeline follows what I mentioned here:

Additionally, I would like to say that there are plans to further optimize the fusion step to decrease computation times using the Code Ocean Pipeline feature. This would be feasible only with a zarr dataset per channel since the idea is to use the registration transforms from the alignment step, copy this to N instances ( N = len(channels) ), and fuse them in parallel.

At the moment, multiple computation instances are not able to communicate between them in Code Ocean but this could be something to look for in the future. However, this does not seem to be the priority at the moment.

AllenNeuralDynamics / aind-smartspim-stitch

Channel organization and OME-Zarr #29