Open AhmetCanSolak opened 1 year ago
We do need a unified reader module for the diverse datasets we have, which may or may not be a universal entry point. We have accumulated significant technical debt in the form of a diversity of data formats. One way to prevent further accumulation of technical debt is to read the legacy formats (which typically vary in terms of the hierarchy of groups, order of dimensions in ND arrays, and metadata format), but write only with the new writer implemented in iohub.
@ziw-liu after #31, let's discuss how to integrate readers implemented in waveOrder
and dexp
in iohub.
We will need to divide and conquer the work of adding support for the different formats.
@AhmetCanSolak Thanks for moving this discussion over.
I also mentioned the possibility of something like skimage.io.imread
during an offline meeting. And I think this functional entry point may alleviate some of the OOP overhead/weirdness we currently have in the code base. With limited refactoring it can be a function that does the format guessing/dispatching and returns a format-specific reader object (whose type is a child of the abstract ReaderBase
that provides a mostly uniform interface) as the WaveorderReader
does, but without the extra layer of API delegation.
For the 'universal dataset class' I feel that it cannot and probably should not be feature-complete for the heterogeneous format space we have. And there is a lot to learn from aicsimageio, which attempts to provide a universal interface to many more biomedical imaging data formats.
I've started working on the imread
idea in #71. Still WIP but early-stage input is very welcome!
@ziw-liu and I just met to discuss our plan for imread
, especially wrt #90. Should imread
provide a universal entry point that returns different objects depending on the underlying file?
We're both concerned that this will lead to an API that is destined to change. If imread
is a thin wrapper on the old reader.__init__
and the new open_ome_zarr
, then iohub
's users will need to handle both of these cases which removes most of the value of having a universal API.
We discussed the pathway to a universal API, and we agreed that there's a lot of engineering work remaining. @ziw-liu expects the ReaderBase
class will need to be rewritten, and all of the child classes will need to be reworked before we can have an API that is anywhere close to universal.
Given these constraints and our release timeline, we're now leaning towards having iohub info
be our human-readable "universal entry point". We can quite easily make iohub info
work for every dataset, and in the iohub info
output we can print two lines of code that show the user how to open the dataset in question. For example,
iohub info my.ome.tif
would print
***shape info here***
Open this data set with the following lines:
$ python
>>> from iohub.reader import Reader
>>> dataset = Reader('./my.ome.tif')
>>> tczyx_array = dataset.get_position(0) # or possibly an iteration through positions
while running
iohub info my.ome.zarr
would print
***shape info here***
Open this data set with:
$ python
>>> from iohub.ngff import open_ome_zarr
>>> dataset = open_ome_zarr'./my.ome.zarr')
>>> tczyx_array = dataset["0"] # or possibly an iteration through positions
In many ways this solution is a bandaid. But I think it provides the shortest and simplest path for users to (1) see if iohub
will work for their dataset, and (2) find the relevant usage API that lets them access their data in a script/package. I also think that it is an appropriate solution for an 0.1 release that is attempting to unify a large existing codebase that is already supporting many use cases.
This and #92 can be great to talk on next iohub meeting, I will add to the agenda.
Hi @ziw-liu, How about we divide this issue in two - one to unify the API for zarr formats and the other to unify API for tiff formats?
This specific issue can focus on unifying the array-like read/write API for zarr formats, i.e., ome-zarr v0.1 and v0.4. Once this closes, we can move forward with the first pip installable iohub
.
All of the tiff-related issues can be addressed in the release to follow.
PS: for custom zarr formats (e.g. PTI), I think it is sufficient to add converters as needed.
This specific issue can focus on unifying the array-like read/write API for zarr formats, i.e., ome-zarr v0.1 and v0.4.
My understanding was that we intentionally do not support writing NGFF v0.1 to promote transition to the new version. Is that still the plan?
Hi @ziw-liu, How about we divide this issue in two - one to unify the API for zarr formats and the other to unify API for tiff formats?
Yeah that breaks down things a bit. I suggest we use #53 for TIFF. I already posted a prototype example there, more discussion is welcome!
My understanding was that we intentionally do not support writing NGFF v0.1 to promote transition to the new version. Is that still the plan?
Yes! I am thinking that the open_ome_zarr(format=v.01)
supports only the read-only mode.
Alternatively, does it make more sense to convert the v0.1 format to the v0.4 format?
I added my input re: TIFF format to #53. I think we should release iohub 0.1.0 once we have unified the API for ome-zarr format and have converters in place.
Alternatively, does it make more sense to convert the v0.1 format to the v0.4 format?
This can be implemented and the conversion will be mostly lossless (unless NGFF removed any fields along the way that I don't know of).
We reached consensus during todays meeting about the UAPI design.
This will likely only involve minimal change on the NGFF side.
My understanding of the consensus we reached today is that multi-position datasets are fundamentally different from single-position datasets (at least in the eyes of the ome-zarr spec), so these two cases should have separate APIs. @ziw-liu can you expand on what this means for the planned user-facing APIs and the classes in iohub?
Are you still planning an imread
function that returns different things depending on the input? Or are you imagining two functions like read_fov
and read_collection
?
@JoOkuma @edyoshikun and I discussed a possible solution where read_fov
returns a 5 dimensional array, while read_collection
returns a 6 dimensional array, even if you pass it a path to a single fov. Would this work?
read_collection
returns a 6 dimensional array, even if you pass it a path to a single fov.
As we discussed in #31 and during today's meeting, 'position' is not an intrinsic spatial dimension of an image tensor, but a level where arbitrary metadata that may or may not be ordered is associated with an image tensor based on the specific experiment. A TCZYX 5D tensor already has all 3 spatial dimensions. Therefore we lack a clear definition of the sequence to map image tensors in a collection to a predictable ordered list. IMO it is better to be verbose and deterministic rather than relying on implementation details or even randomness to access 'the 42nd FOV' in a dataset.
An alternative way of implementing read_collection
is to have it always return a mappingof FOVs. So one can always do:
for name, fov in read_collection("some.zarr"):
assert isinstance(name, str)
assert isinstance(fov, FOVBase)
But knowing explicitly that the sequence in which they get the items is not guaranteed to be meaningful.
@ziw-liu I'm happy with being able to iterate as in your code above. @talonchandler does your PTCZYX interface needs P as a numpy-like axis or being an iterable is fine?
For the FOV case, name
could always be fov
, just to make it consistent.
Question: Are the positions always saved in arbitrary order? It could follow a lexicographic or any other ordering we pick, but I'm guessing some microscopes might not write according to this.
read_collection returns a 6 dimensional array, even if you pass it a path to a single fov.
I think "6-dimensional array" was the wrong word choice...what I had in mind is what @ziw-liu wrote:
"An alternative way of implementing read_collection is to have it always return a mapping of FOVs"
so that you can always get an iterable over positions, even if you pass in a single FOV.
@talonchandler does your PTCZYX interface needs P as a numpy-like axis or being an iterable is fine?
An iterable like what @ziw-liu is suggesting works fine for me.
I like @ziw-liu suggestion to have it as iterable. Just to make sure, when we say 'pass a single FOV' this means the zarr contains only one level like the tiled zarrs and is it equivalent to passing the path to a single FOV i.e /datastore.zarr/0/0/0
?
Interface specification happening in #132!
@AhmetCanSolak Can you append a task list to OP? Something like:
done @ziw-liu
TASKLIST:
This deserves a dedicated discussion IMHO, that's why creating a separate issue here.
Moving forward I think we should find ways to engineer a solution that provides a more unified API to the end user. Particularly for adoption of iohub across Biohub, I think this is crucial. Providing list of aforementioned operations(by @talonchandler ) on each dataset is a great start. But I think there is more than that we can do. Like:
skimage.io
and help scripting users to adopt iohub quicker?This is merely a starting point for universal entry points discussions, looking forward to read everyone's input.