Expand TiffGlobReader API to work for all file formats

AllenCellModeling / aicsimageio

Image Reading, Metadata Conversion, and Image Writing for Microscopy Images in Python

https://allencellmodeling.github.io/aicsimageio

Other

204 stars 51 forks source link

Expand TiffGlobReader API to work for all file formats #166

Closed evamaxfield closed 1 year ago

evamaxfield commented 3 years ago

Use Case

Please provide a use case to help us understand your request in context

A lot of imaging datasets are stored as a single acquisition split into many individual files. For example: a bunch of TIFF image images representing a single acquisiton but each TIFF is a multi-channel YX only and each TIFF in the glob is a different Z plane.

Solution

Please describe your ideal solution

Create a GlobReader class that allows for reading /some/dir/*.tiff or /some/dir/*.czi or similar and add a parameter such as: glob_dimension to set which dimension to act as the file.

from aicsimageio.readers import GlobReader

r = GlobReader("my-file.*.tiff", glob_dimension="Z")
r  # repr is <GlobReader (TiffReader) [in-memory: False, glob_dimension: Z]>

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

These datasets will largely (hopefully) be replaced by NGFF (i.e. Zarr) but would be great to handle them in general too.

jrussell25 commented 3 years ago

Hi there, I see that this has been open for a while but if there is still interest in supporting this in aicsimageio I have some code that I have been using to do something similar. Mine is maybe a little more general than the original issue suggests in these sense that it can handle globs for arbitrary dimensional data. The function takes a dataframe (which i generate with something like pd.Dataframe({'filename':sorted(glob.glob(...))})) and what I call idx_mapper which provides the position in the final array that each single frame file will end up occupying. This can be a callable (usually a small wrapper around re.split) or just another dataframe. It then uses pandas groupby to chunk by any of the dimensions in the dataset.

https://github.com/Hekstra-Lab/microutil/blob/main/microutil/leica/leica.py#L274

The original version I made is highly similar but specifically handles old style micromanager images.

https://github.com/Hekstra-Lab/microutil/blob/main/microutil/loading.py#L125

Let me know if you are interested in using this approach and I can un-hardcode stuff and make an initial PR.

evamaxfield commented 3 years ago

Hey @jrussell25!

I would be thrilled to get a version like yours in!

Two clarifying questions:

Do you have quick example of usage of your current functions, just the like three lines of generate dataframe -> function -> print out dask shape + chunk shape?
I would assume that a small example would explain this for me but can you specifically clarifying this:

It then uses pandas groupby to chunk by any of the dimensions in the dataset?

But yes, if you would like to make a PR we would glady take this in.

cc @toloudis

toloudis commented 3 years ago

A couple of miscellaneous thoughts... (and yes a PR would be great.)

I think GlobReader sounds about like the right way to do it. An alternative would be to let the individual readers accept glob strings and have some base class support functions.

There are some implementation details I would want to capture, to make sure the API is sane and behaves in an easy to understand way.

All the files must be of the same type, and that type is determined either by passing an arg in, or by the first file that matches the pattern
Suppose glob_dimension is Z but then we have a series of 3d tiffs. The internals need to negotiate all that and figure out the proper dims to come back.
Outside users don't need to know whether it's one file or many (after initialization)

jrussell25 commented 3 years ago

Cool! Thanks for the interest. Here is a snippet that generates some fake data and then loads it with this function.

import re
import os
import pandas as pd
import numpy as np
import tifffile as tiff
import microutil as mu #current home of this function

# Generate fake data
try:
    os.mkdir("path_to_data")
except FileExistsError: pass

indices = np.indices((2,3,4,5)).reshape(4,-1).T
for s,t,c,z in indices:
    tiff.imsave(f"path_to_data/data_pos{s}_t{t}_ch{c}_z{z}.tif", np.random.randint(0,256, size=(64,64), dtype='uint8'))

#generate the dataframe with filenames from a glob
df = pd.DataFrame({'filename':sorted(glob.glob('path_to_data/*.tif'))})
# function to parse the filenames into indices
indexer = lambda x: pd.Series(re.findall(r"\d+",x.filename.split('/')[-1]), index=list("STCZ")).astype(int)

fake_data = mu.leica.load_leica_frames(df, indexer, chunkby_dims='CZ')

print(f"{fake_data.shape=}") #Should be (2,3,4,5,64,64)
print(f"{fake_data.data.chunksize=}") #Should be (1,1,4,5,64,64)

jrussell25 commented 3 years ago

@JacksonMaxfield this builds off the last snippet to address your question 2. Its a lightly improved version of the main loop

df = df.join(df.apply(indexer, axis=1, result_type='expand'))
chunkby_dims='CZ'

group_dims = [x for x in df.columns[1:] if x not in chunkby_dims]
cutoffs = df.groupby(group_dims).nunique().min().drop('filename')
df = df.loc[(df.loc[:, list(chunkby_dims)] < cutoffs).all('columns')]
chunks = np.zeros(df[group_dims].nunique().values, dtype='object')

for idx, val in df.groupby(group_dims):
    # by default each file will be in its own chunk so rechunk them together here
    darr = da.from_zarr(tiff.imread(val.filename.tolist(), aszarr=True)).rechunk(-1) 
    shape = tuple(x for i, x in cutoffs.iteritems() if i in chunkby_dims) + darr.shape[-2:]
    darr = darr.reshape(shape)
    chunks[idx] = darr

chunks = np.expand_dims(chunks, tuple(range(-1, -len(chunkby_dims) - 3, -1)))
d_data = da.block(chunks.tolist())

So the idea is that by using a groupby on everything except the chunkby_dims those dimensions remain in separate chunks after the call to da.block. As I tried to comment in there, the default behavior of da.from_zarr(tiff.imread(...)) is to put each separate file in its own chunk so to force something else you rechunk inside the loop. Let me know if this still isnt clear.

evamaxfield commented 3 years ago

This is great @jrussell25!

I think for the sake of defaults I would love to see basically this design but just with some nicer API defaults.

For example, if the user gives an iterable of string or path then the function should just dump that iterable of string to the dataframe for the user. Same with the indexer. We should just use the one you use in your example as the default indexer and document as "Default 'GlobReader' will parse indexes with a S_{s_index}_T_{t_index}_C_{c_index}_Z_{z_index} matcher."

Finally some minor differences between this and how AICSImage Readers work is that our scenes are stateful. So I think storing the dataframe of the file list is fine but then to get "stateful" scenes just group by the S_index.

My only other question is how to handle metadata. Do we read all the metadatas? Do we read just one? Do we "average" the metadata?

Other than that, this looks great!

jrussell25 commented 3 years ago

Hi, i just opened a preliminary PR for this, sorry it took a while. as I mentioned there i have some questions Im not sure about for getting GlobReader fully ready to go. But I thought it made sense to get some more eyes on it before doing more.

Questions

How should image specific metadata be handled (especially in the case that it differs among the files)?
In what case (if any) should AICSImage guess that the user wants use glob reader?
What is the most complex sort of single file that needs to be able to be read and assembled by GlobReader? this may be mostly for testing/documentation purposes since I have really only used this idea for sets of 2D imagefiles.
Is it worth creating an example of how to construct an indexer? (Unsurprisingly) it makes sense to me but I find it a little tricky to explain in words (i.e. the docstring) without an example.

evamaxfield commented 3 years ago

These are great questions here are my takes but I am sure @toloudis will have opinions as well:

How should image specific metadata be handled (especially in the case that it differs among the files)?

There is a part of me that says "metadata should be returned as a List[Any] or depending on if this is a list of pure tiff images then List[ElementTree] or if it is a list of OME-TIFFs then List[OME].... But I could also see this working as "just choose the first one... Maybe we can ask the community what they want? I can easily ping the napari + OME teams for input.

In what case (if any) should AICSImage guess that the user wants use glob reader?

Now this is a great question I hadn't considered. I would guess by types or by content of type. I see in the PR that it accepts Union[str, List[str]] so I would assume AICSImage should use GlobReader if the string contains a * (wildcard) character or if it is a List[str]... I don't think we have any other readers that accept a List[str]

What is the most complex sort of single file that needs to be able to be read and assembled by GlobReader? this may be mostly for testing/documentation purposes since I have really only used this idea for sets of 2D imagefiles.

Again we should probably ask the community? But off the top off my head I would say it would be something like "Glob dimensions are Scene and Z? So each file contains all TCYX? I just think a scene glob dimension will be the weird one for us. Idk, we may just want to test all permutations of non YX dims. Or just a subset anyway.

Is it worth creating an example of how to construct an indexer? (Unsurprisingly) it makes sense to me but I find it a little tricky to explain in words (i.e. the docstring) without an example.

Absolutely. This would be a wonderful addition(s).

I also really can't understate how excited I am about this addition. It's both a huge feature and a major contribution so thank you so much.

toloudis commented 3 years ago

How should image specific metadata be handled (especially in the case that it differs among the files)?

There is a part of me that says "metadata should be returned as a List[Any] or depending on if this is a list of pure tiff images then List[ElementTree] or if it is a list of OME-TIFFs then List[OME].... But I could also see this working as "just choose the first one... Maybe we can ask the community what they want? I can easily ping the napari + OME teams for input.

This is tricky. We are loading into a single AICSImage object and from that API it will look like a single image. It would be amazing if we could generate one single OME from the metadata but it does seem unrealistic. I guess List[native metadata object] is a good way to go but then the list really is one per file in the glob. Users who have an AICSImage will have to know that they are working with a glob. Dims, channel names etc will have to be assembled so that they are "Correct" from the perspective of the AICSImage api, treating the aggregate as a single image.

In what case (if any) should AICSImage guess that the user wants use glob reader?

Now this is a great question I hadn't considered. I would guess by types or by content of type. I see in the PR that it accepts Union[str, List[str]] so I would assume AICSImage should use GlobReader if the string contains a * (wildcard) character or if it is a List[str]... I don't think we have any other readers that accept a List[str]

I agree that the presence of wildcard characters or list of files, is enough to go to GlobReader.

What is the most complex sort of single file that needs to be able to be read and assembled by GlobReader? this may be mostly for testing/documentation purposes since I have really only used this idea for sets of 2D imagefiles.

Again we should probably ask the community? But off the top off my head I would say it would be something like "Glob dimensions are Scene and Z? So each file contains all TCYX? I just think a scene glob dimension will be the weird one for us. Idk, we may just want to test all permutations of non YX dims. Or just a subset anyway.

Covering all permutations of Scene, Z, C, T is a good idea for generality. There might be some combinations that make less sense. If there is an easier thing that would enable 80% of users, then maybe it makes sense to do that (e.g. support only single glob dim or permutations of 2 glob dims)

evamaxfield commented 3 years ago

This is tricky. We are loading into a single AICSImage object and from that API it will look like a single image. It would be amazing if we could generate one single OME from the metadata but it does seem unrealistic. I guess List[native metadata object] is a good way to go but then the list really is one per file in the glob. Users who have an AICSImage will have to know that they are working with a glob. Dims, channel names etc will have to be assembled so that they are "Correct" from the perspective of the AICSImage api, treating the aggregate as a single image.

@toloudis I think the way to do it would be to do List[native object] and then the ome_metadata property does the single construction. Since we have the API already to do "some sort of metadata processing / translation" that makes the most sense to me.

jrussell25 commented 3 years ago

Sorry ive been off this for so long again. the multidimensional case took me a bit longer than i thought. Im not totally sure what native_object would be here. But also im doing all of this with my fake tiffs from tifffile that dont have much metadata. Anyway i was looking at how TiffReader handled one of my files and it just has something like '{"shape": [6, 7]}' so is the thinking just make a list of that for each file? Or do we want all the tifftags?

I also currently have not implemented ome_metadata so it might take some more work to make that happen.

evamaxfield commented 3 years ago

Reopening as "GlobReader for all file types"

TiffGlobReader was added by @jrussell25. Thanks again!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.