Closed evamaxfield closed 1 year ago
Hi there, I see that this has been open for a while but if there is still interest in supporting this in aicsimageio
I have some code that I have been using to do something similar. Mine is maybe a little more general than the original issue suggests in these sense that it can handle globs for arbitrary dimensional data. The function takes a dataframe (which i generate with something like pd.Dataframe({'filename':sorted(glob.glob(...))})
) and what I call idx_mapper
which provides the position in the final array that each single frame file will end up occupying. This can be a callable (usually a small wrapper around re.split
) or just another dataframe. It then uses pandas groupby to chunk by any of the dimensions in the dataset.
https://github.com/Hekstra-Lab/microutil/blob/main/microutil/leica/leica.py#L274
The original version I made is highly similar but specifically handles old style micromanager images.
https://github.com/Hekstra-Lab/microutil/blob/main/microutil/loading.py#L125
Let me know if you are interested in using this approach and I can un-hardcode stuff and make an initial PR.
Hey @jrussell25!
I would be thrilled to get a version like yours in!
Two clarifying questions:
It then uses pandas groupby to chunk by any of the dimensions in the dataset?
But yes, if you would like to make a PR we would glady take this in.
cc @toloudis
A couple of miscellaneous thoughts... (and yes a PR would be great.)
I think GlobReader sounds about like the right way to do it. An alternative would be to let the individual readers accept glob strings and have some base class support functions.
There are some implementation details I would want to capture, to make sure the API is sane and behaves in an easy to understand way.
glob_dimension
is Z but then we have a series of 3d tiffs. The internals need to negotiate all that and figure out the proper dims to come back. Cool! Thanks for the interest. Here is a snippet that generates some fake data and then loads it with this function.
import re
import os
import pandas as pd
import numpy as np
import tifffile as tiff
import microutil as mu #current home of this function
# Generate fake data
try:
os.mkdir("path_to_data")
except FileExistsError: pass
indices = np.indices((2,3,4,5)).reshape(4,-1).T
for s,t,c,z in indices:
tiff.imsave(f"path_to_data/data_pos{s}_t{t}_ch{c}_z{z}.tif", np.random.randint(0,256, size=(64,64), dtype='uint8'))
#generate the dataframe with filenames from a glob
df = pd.DataFrame({'filename':sorted(glob.glob('path_to_data/*.tif'))})
# function to parse the filenames into indices
indexer = lambda x: pd.Series(re.findall(r"\d+",x.filename.split('/')[-1]), index=list("STCZ")).astype(int)
fake_data = mu.leica.load_leica_frames(df, indexer, chunkby_dims='CZ')
print(f"{fake_data.shape=}") #Should be (2,3,4,5,64,64)
print(f"{fake_data.data.chunksize=}") #Should be (1,1,4,5,64,64)
@JacksonMaxfield this builds off the last snippet to address your question 2. Its a lightly improved version of the main loop
df = df.join(df.apply(indexer, axis=1, result_type='expand'))
chunkby_dims='CZ'
group_dims = [x for x in df.columns[1:] if x not in chunkby_dims]
cutoffs = df.groupby(group_dims).nunique().min().drop('filename')
df = df.loc[(df.loc[:, list(chunkby_dims)] < cutoffs).all('columns')]
chunks = np.zeros(df[group_dims].nunique().values, dtype='object')
for idx, val in df.groupby(group_dims):
# by default each file will be in its own chunk so rechunk them together here
darr = da.from_zarr(tiff.imread(val.filename.tolist(), aszarr=True)).rechunk(-1)
shape = tuple(x for i, x in cutoffs.iteritems() if i in chunkby_dims) + darr.shape[-2:]
darr = darr.reshape(shape)
chunks[idx] = darr
chunks = np.expand_dims(chunks, tuple(range(-1, -len(chunkby_dims) - 3, -1)))
d_data = da.block(chunks.tolist())
So the idea is that by using a groupby on everything except the chunkby_dims
those dimensions remain in separate chunks after the call to da.block
. As I tried to comment in there, the default behavior of da.from_zarr(tiff.imread(...))
is to put each separate file in its own chunk so to force something else you rechunk inside the loop. Let me know if this still isnt clear.
This is great @jrussell25!
I think for the sake of defaults I would love to see basically this design but just with some nicer API defaults.
For example, if the user gives an iterable of string or path then the function should just dump that iterable of string to the dataframe for the user. Same with the indexer. We should just use the one you use in your example as the default indexer and document as "Default 'GlobReader' will parse indexes with a S_{s_index}_T_{t_index}_C_{c_index}_Z_{z_index}
matcher."
Finally some minor differences between this and how AICSImage Readers work is that our scenes are stateful. So I think storing the dataframe of the file list is fine but then to get "stateful" scenes just group by the S_index.
My only other question is how to handle metadata. Do we read all the metadatas? Do we read just one? Do we "average" the metadata?
Other than that, this looks great!
Hi, i just opened a preliminary PR for this, sorry it took a while. as I mentioned there i have some questions Im not sure about for getting GlobReader
fully ready to go. But I thought it made sense to get some more eyes on it before doing more.
Questions
AICSImage
guess that the user wants use glob reader?GlobReader
? this may be mostly for testing/documentation purposes since I have really only used this idea for sets of 2D imagefiles.These are great questions here are my takes but I am sure @toloudis will have opinions as well:
- How should image specific metadata be handled (especially in the case that it differs among the files)?
There is a part of me that says "metadata should be returned as a List[Any]
or depending on if this is a list of pure tiff
images then List[ElementTree]
or if it is a list of OME-TIFFs then List[OME]
.... But I could also see this working as "just choose the first one... Maybe we can ask the community what they want? I can easily ping the napari + OME teams for input.
- In what case (if any) should AICSImage guess that the user wants use glob reader?
Now this is a great question I hadn't considered. I would guess by types or by content of type. I see in the PR that it accepts Union[str, List[str]]
so I would assume AICSImage
should use GlobReader
if the string contains a *
(wildcard) character or if it is a List[str]
... I don't think we have any other readers that accept a List[str]
- What is the most complex sort of single file that needs to be able to be read and assembled by GlobReader? this may be mostly for testing/documentation purposes since I have really only used this idea for sets of 2D imagefiles.
Again we should probably ask the community? But off the top off my head I would say it would be something like "Glob dimensions are Scene and Z? So each file contains all TCYX? I just think a scene glob dimension will be the weird one for us. Idk, we may just want to test all permutations of non YX dims. Or just a subset anyway.
- Is it worth creating an example of how to construct an indexer? (Unsurprisingly) it makes sense to me but I find it a little tricky to explain in words (i.e. the docstring) without an example.
Absolutely. This would be a wonderful addition(s).
I also really can't understate how excited I am about this addition. It's both a huge feature and a major contribution so thank you so much.
- How should image specific metadata be handled (especially in the case that it differs among the files)?
There is a part of me that says "metadata should be returned as a
List[Any]
or depending on if this is a list of puretiff
images thenList[ElementTree]
or if it is a list of OME-TIFFs thenList[OME]
.... But I could also see this working as "just choose the first one... Maybe we can ask the community what they want? I can easily ping the napari + OME teams for input.
This is tricky. We are loading into a single AICSImage object and from that API it will look like a single image. It would be amazing if we could generate one single OME from the metadata but it does seem unrealistic. I guess List[native metadata object] is a good way to go but then the list really is one per file in the glob. Users who have an AICSImage will have to know that they are working with a glob. Dims, channel names etc will have to be assembled so that they are "Correct" from the perspective of the AICSImage api, treating the aggregate as a single image.
- In what case (if any) should AICSImage guess that the user wants use glob reader?
Now this is a great question I hadn't considered. I would guess by types or by content of type. I see in the PR that it accepts
Union[str, List[str]]
so I would assumeAICSImage
should useGlobReader
if the string contains a*
(wildcard) character or if it is aList[str]
... I don't think we have any other readers that accept aList[str]
I agree that the presence of wildcard characters or list of files, is enough to go to GlobReader.
- What is the most complex sort of single file that needs to be able to be read and assembled by GlobReader? this may be mostly for testing/documentation purposes since I have really only used this idea for sets of 2D imagefiles.
Again we should probably ask the community? But off the top off my head I would say it would be something like "Glob dimensions are Scene and Z? So each file contains all TCYX? I just think a scene glob dimension will be the weird one for us. Idk, we may just want to test all permutations of non YX dims. Or just a subset anyway.
Covering all permutations of Scene, Z, C, T is a good idea for generality. There might be some combinations that make less sense. If there is an easier thing that would enable 80% of users, then maybe it makes sense to do that (e.g. support only single glob dim or permutations of 2 glob dims)
This is tricky. We are loading into a single AICSImage object and from that API it will look like a single image. It would be amazing if we could generate one single OME from the metadata but it does seem unrealistic. I guess List[native metadata object] is a good way to go but then the list really is one per file in the glob. Users who have an AICSImage will have to know that they are working with a glob. Dims, channel names etc will have to be assembled so that they are "Correct" from the perspective of the AICSImage api, treating the aggregate as a single image.
@toloudis I think the way to do it would be to do List[native object]
and then the ome_metadata
property does the single construction. Since we have the API already to do "some sort of metadata processing / translation" that makes the most sense to me.
Sorry ive been off this for so long again. the multidimensional case took me a bit longer than i thought. Im not totally sure what native_object
would be here. But also im doing all of this with my fake tiffs from tifffile
that dont have much metadata. Anyway i was looking at how TiffReader
handled one of my files and it just has something like '{"shape": [6, 7]}'
so is the thinking just make a list of that for each file? Or do we want all the tifftags?
I also currently have not implemented ome_metadata
so it might take some more work to make that happen.
Reopening as "GlobReader for all file types"
TiffGlobReader
was added by @jrussell25. Thanks again!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Use Case
Please provide a use case to help us understand your request in context
A lot of imaging datasets are stored as a single acquisition split into many individual files. For example: a bunch of TIFF image images representing a single acquisiton but each TIFF is a multi-channel YX only and each TIFF in the glob is a different Z plane.
Solution
Please describe your ideal solution
Create a
GlobReader
class that allows for reading/some/dir/*.tiff
or/some/dir/*.czi
or similar and add a parameter such as:glob_dimension
to set which dimension to act as the file.Alternatives
Please describe any alternatives you've considered, even if you've dismissed them
These datasets will largely (hopefully) be replaced by NGFF (i.e. Zarr) but would be great to handle them in general too.