Open miketaormina opened 1 year ago
Hello @miketaormina, thanks for starting this conversation. In terms of implementation and the mentioned cons, I can share the following:
Additionally, I would like to say that there are plans to further optimize the fusion step to decrease computation times using the Code Ocean Pipeline feature. This would be feasible only with a zarr dataset per channel since the idea is to use the registration transforms from the alignment step, copy this to N instances ( N = len(channels) ), and fuse them in parallel.
Finally, I would love moving forward the multichannel OMEZarr dataset since this seems to be the right thing to do to match the NGFF specs. Nevertheless, this would to take some weeks.
@sharmishtaa @dyf
Yes, we should do this in the next revision of the pipeline.
Thanks @camilolaiton , this is good background info and makes sense.
For what it's worth, I don't think you'll have a problem with multiple instances writing to the same zarr file, since it's actually a directory and not a file. That is, unless by instance you are referring to capsules or if the AWS angle changes things. The format is pretty explicitly capable of this, as long as you're not writing to the same chunk. This is based on my understanding of zarr (here and here) and not any understanding of the implemented pipeline though. I know you're using dask arrays, which might complicate it if you call the to_zarr()
method.
Edit- of course I somehow glazed over this from a link above, so my bad:
Zarr arrays have not been designed for situations where multiple readers and writers are concurrently operating on the same array.
Edit2- not to ramble too much, but the documentation is a little confusing on this point: does the above apply to everything under the .zarray
file, or are they referring to chunk
when they say array
? The sentence immediately before this one implies it's possible:
By data sink we mean that multiple concurrent write operations may occur, with each writer updating a different region of the array.
Do we have new opinions about this? @dyf @miketaormina
The new SmartSPIM pipeline follows what I mentioned here:
Additionally, I would like to say that there are plans to further optimize the fusion step to decrease computation times using the Code Ocean Pipeline feature. This would be feasible only with a zarr dataset per channel since the idea is to use the registration transforms from the alignment step, copy this to N instances ( N = len(channels) ), and fuse them in parallel.
At the moment, multiple computation instances are not able to communicate between them in Code Ocean but this could be something to look for in the future. However, this does not seem to be the priority at the moment.
I wanted to start a discussion about how we organize data and how it conforms to the OME-Zarr specification. Namely, I wanted to ask "should our spectral channels continue to be independent zarr files, or should they be on the
c
axis of a single NGFF spec file?"This, of course, doesn't need to be a priority, but I think the discussion is worthwhile. I can foresee the following pros/cons of consolidating the volume in this way:
Pros:
glob('*Ex*.zarr')
type of search within a parent folder and parse the file name for channel identity, as opposed to looking within the metadata of the file to get channel identities.Cons:
0, 1, 2, . . .
folders instead ofEx_488_Em_525.zarr, . . .
folders.I think that @dyf may have an opinion about the data organization/packaging aspects and @camilolaiton would have insight into the complexity of implementation.