czbiohub-sf / iohub

Pythonic and parallelizable I/O for N-dimensional imaging data with OME metadata
https://czbiohub-sf.github.io/iohub/
BSD 3-Clause "New" or "Revised" License
26 stars 6 forks source link

Universal entry points #40

Open AhmetCanSolak opened 1 year ago

AhmetCanSolak commented 1 year ago

TASKLIST:



          > At this point I don't think a universal entry point for all the formats is strictly necessary.

I think that a limited version of a "universal entry point" for internal biohub formats is one the core value propositions of iohub. If we're requiring a user to know their data type and its corresponding API to access data/metadata, then I think adoption of iohub will be limited.

The current version of WaveorderReader in waveorder has been valuable to me because it serves as a universal entry point for compmicro's work. For example, when we started reading pycromanager datasets, @ieivanov added the PycromanagerReader class and adapted WaveorderReader to read these datasets with the existing API. I could continue to run WaveorderReader.shape and WaveorderReader.get_array with predictable results, and any unpredictable results were considered bugs and fixed.

I understand that there may be difficulty in the details. For example, PycromanagerReader.get_zarr might be incorrectly/poorly named, and some dataset types might support different lazy-loading operations (maybe .get_zarr is not one of the universal entry points?). But I still think that all of our datasets share some common properties and operations that users will want to perform regularly. I think this list of operations consists of

  • what is the shape of the data in this dataset? (.shape)
  • what is it's basic metadata? (if it's available) (.channel_names, .z_step_size, .dtype, etc)
  • what are the actual values in this dataset? (.get_array)

Can we have a universal entry point that supports these operations?

Originally posted by @talonchandler in https://github.com/czbiohub/iohub/issues/31#issuecomment-1419560365

This deserves a dedicated discussion IMHO, that's why creating a separate issue here.

Moving forward I think we should find ways to engineer a solution that provides a more unified API to the end user. Particularly for adoption of iohub across Biohub, I think this is crucial. Providing list of aforementioned operations(by @talonchandler ) on each dataset is a great start. But I think there is more than that we can do. Like:

This is merely a starting point for universal entry points discussions, looking forward to read everyone's input.

mattersoflight commented 1 year ago

We do need a unified reader module for the diverse datasets we have, which may or may not be a universal entry point. We have accumulated significant technical debt in the form of a diversity of data formats. One way to prevent further accumulation of technical debt is to read the legacy formats (which typically vary in terms of the hierarchy of groups, order of dimensions in ND arrays, and metadata format), but write only with the new writer implemented in iohub.

@ziw-liu after #31, let's discuss how to integrate readers implemented in waveOrder and dexp in iohub.

We will need to divide and conquer the work of adding support for the different formats.

ziw-liu commented 1 year ago

@AhmetCanSolak Thanks for moving this discussion over.

I also mentioned the possibility of something like skimage.io.imread during an offline meeting. And I think this functional entry point may alleviate some of the OOP overhead/weirdness we currently have in the code base. With limited refactoring it can be a function that does the format guessing/dispatching and returns a format-specific reader object (whose type is a child of the abstract ReaderBase that provides a mostly uniform interface) as the WaveorderReader does, but without the extra layer of API delegation.

For the 'universal dataset class' I feel that it cannot and probably should not be feature-complete for the heterogeneous format space we have. And there is a lot to learn from aicsimageio, which attempts to provide a universal interface to many more biomedical imaging data formats.

ziw-liu commented 1 year ago

I've started working on the imread idea in #71. Still WIP but early-stage input is very welcome!

talonchandler commented 1 year ago

@ziw-liu and I just met to discuss our plan for imread, especially wrt #90. Should imread provide a universal entry point that returns different objects depending on the underlying file?

We're both concerned that this will lead to an API that is destined to change. If imread is a thin wrapper on the old reader.__init__ and the new open_ome_zarr, then iohub's users will need to handle both of these cases which removes most of the value of having a universal API.

We discussed the pathway to a universal API, and we agreed that there's a lot of engineering work remaining. @ziw-liu expects the ReaderBase class will need to be rewritten, and all of the child classes will need to be reworked before we can have an API that is anywhere close to universal.

Given these constraints and our release timeline, we're now leaning towards having iohub info be our human-readable "universal entry point". We can quite easily make iohub info work for every dataset, and in the iohub info output we can print two lines of code that show the user how to open the dataset in question. For example,

iohub info my.ome.tif

would print

***shape info here***

Open this data set with the following lines:
$ python
>>> from iohub.reader import Reader
>>> dataset = Reader('./my.ome.tif')
>>> tczyx_array = dataset.get_position(0) # or possibly an iteration through positions

while running

iohub info my.ome.zarr

would print

***shape info here***

Open this data set with:
$ python
>>> from iohub.ngff import open_ome_zarr
>>> dataset = open_ome_zarr'./my.ome.zarr')
>>> tczyx_array = dataset["0"] # or possibly an iteration through positions

In many ways this solution is a bandaid. But I think it provides the shortest and simplest path for users to (1) see if iohub will work for their dataset, and (2) find the relevant usage API that lets them access their data in a script/package. I also think that it is an appropriate solution for an 0.1 release that is attempting to unify a large existing codebase that is already supporting many use cases.

AhmetCanSolak commented 1 year ago

This and #92 can be great to talk on next iohub meeting, I will add to the agenda.

mattersoflight commented 1 year ago

Hi @ziw-liu, How about we divide this issue in two - one to unify the API for zarr formats and the other to unify API for tiff formats?

This specific issue can focus on unifying the array-like read/write API for zarr formats, i.e., ome-zarr v0.1 and v0.4. Once this closes, we can move forward with the first pip installable iohub.

All of the tiff-related issues can be addressed in the release to follow.

PS: for custom zarr formats (e.g. PTI), I think it is sufficient to add converters as needed.

ziw-liu commented 1 year ago

This specific issue can focus on unifying the array-like read/write API for zarr formats, i.e., ome-zarr v0.1 and v0.4.

My understanding was that we intentionally do not support writing NGFF v0.1 to promote transition to the new version. Is that still the plan?

ziw-liu commented 1 year ago

Hi @ziw-liu, How about we divide this issue in two - one to unify the API for zarr formats and the other to unify API for tiff formats?

Yeah that breaks down things a bit. I suggest we use #53 for TIFF. I already posted a prototype example there, more discussion is welcome!

mattersoflight commented 1 year ago

My understanding was that we intentionally do not support writing NGFF v0.1 to promote transition to the new version. Is that still the plan?

Yes! I am thinking that the open_ome_zarr(format=v.01) supports only the read-only mode.

Alternatively, does it make more sense to convert the v0.1 format to the v0.4 format?

mattersoflight commented 1 year ago

I added my input re: TIFF format to #53. I think we should release iohub 0.1.0 once we have unified the API for ome-zarr format and have converters in place.

ziw-liu commented 1 year ago

Alternatively, does it make more sense to convert the v0.1 format to the v0.4 format?

This can be implemented and the conversion will be mostly lossless (unless NGFF removed any fields along the way that I don't know of).

ziw-liu commented 1 year ago

We reached consensus during todays meeting about the UAPI design.

This will likely only involve minimal change on the NGFF side.

talonchandler commented 1 year ago

My understanding of the consensus we reached today is that multi-position datasets are fundamentally different from single-position datasets (at least in the eyes of the ome-zarr spec), so these two cases should have separate APIs. @ziw-liu can you expand on what this means for the planned user-facing APIs and the classes in iohub?

Are you still planning an imread function that returns different things depending on the input? Or are you imagining two functions like read_fov and read_collection?

@JoOkuma @edyoshikun and I discussed a possible solution where read_fov returns a 5 dimensional array, while read_collection returns a 6 dimensional array, even if you pass it a path to a single fov. Would this work?

ziw-liu commented 1 year ago

read_collection returns a 6 dimensional array, even if you pass it a path to a single fov.

As we discussed in #31 and during today's meeting, 'position' is not an intrinsic spatial dimension of an image tensor, but a level where arbitrary metadata that may or may not be ordered is associated with an image tensor based on the specific experiment. A TCZYX 5D tensor already has all 3 spatial dimensions. Therefore we lack a clear definition of the sequence to map image tensors in a collection to a predictable ordered list. IMO it is better to be verbose and deterministic rather than relying on implementation details or even randomness to access 'the 42nd FOV' in a dataset.

An alternative way of implementing read_collection is to have it always return a mappingof FOVs. So one can always do:

for name, fov in read_collection("some.zarr"):
    assert isinstance(name, str)
    assert isinstance(fov, FOVBase)

But knowing explicitly that the sequence in which they get the items is not guaranteed to be meaningful.

JoOkuma commented 1 year ago

@ziw-liu I'm happy with being able to iterate as in your code above. @talonchandler does your PTCZYX interface needs P as a numpy-like axis or being an iterable is fine?

For the FOV case, name could always be fov, just to make it consistent.

Question: Are the positions always saved in arbitrary order? It could follow a lexicographic or any other ordering we pick, but I'm guessing some microscopes might not write according to this.

talonchandler commented 1 year ago

read_collection returns a 6 dimensional array, even if you pass it a path to a single fov.

I think "6-dimensional array" was the wrong word choice...what I had in mind is what @ziw-liu wrote:

"An alternative way of implementing read_collection is to have it always return a mapping of FOVs"

so that you can always get an iterable over positions, even if you pass in a single FOV.

@talonchandler does your PTCZYX interface needs P as a numpy-like axis or being an iterable is fine?

An iterable like what @ziw-liu is suggesting works fine for me.

edyoshikun commented 1 year ago

I like @ziw-liu suggestion to have it as iterable. Just to make sure, when we say 'pass a single FOV' this means the zarr contains only one level like the tiled zarrs and is it equivalent to passing the path to a single FOV i.e /datastore.zarr/0/0/0?

ziw-liu commented 1 year ago

Interface specification happening in #132!

@AhmetCanSolak Can you append a task list to OP? Something like:

AhmetCanSolak commented 1 year ago

done @ziw-liu