OME Tiff Information: Do we want to keep it? Also H5 Writer defaults balloon file size, see last comment

jmdelahanty commented 2 years ago

Bruker outputs OME.tiff files after the ripper has completed but only the first image that's written is the OME.tif master file. This contains any additional channels that are recorded at the time. So if two channels were recorded at once, the OME.tifs that are generated will ONLY have the master-file generated for Ch1 which contains all the data for any subsequent channels inside recorded during a session.

I'm asking questions in a gitter chatroom for zarr developers who have been helpful so far but I haven't heard back about a couple other new things yet.

I'm currently able to parse the OME xml properly but am still learning about the different image tag numbers that are used. There doesn't appear to be a clear explanation for what these different tags are or what other tag information is available in the OME-tiff master file. Here's what I've found out so far about the structure:

root: OME
     Image:
     - AcquisitionDate: Empty dictionary
     - Pixels: Dictionary of dimension order, physical pixel size in micrometers (I think...), number of channels, number of images per channel, dims of x/y/z, and bit depth of images.
            Channel: IDs of channels encoded as Channel:1:0 being channel 1, Channel:1:1 being channel 2
                - TiffData: Channel, number of images gathered before this point, number of planes, mystery IFD value
                        Filename: Bottom level of each channel ID tree, elements for all images that are generated

As of 12/13/21, a method for parsing an OME-tiff master file structured as this one that will preserve the metadata of tiffs for chunked HDF5/zarr files has not been found.

Good news is that it appears that Deryn's code in MATLAB using the h5write function does know how to create chunks of data from the many tiffs! It will take some additional testing to make sure it's doing what we expect, but MATLAB may have solved this issue for us thankfully. Another piece of good news is that we can likely run multiple copies MATLAB at once because it looks like each conversion to H5 only uses 1 CPU for a conversion (probably because it's I/O dependent).

Unfortunately, it looks like MATLAB's h5write doesn't seem to know how to keep the OME metadata. It doesn't seem like not having the metadata gets in the way of any processing or sharing and I'm currently unaware of NWB uses OME data/can store it somehow. I've opened a discussion here to find out more.

However the tiffs are processed, each channel should have it's own chunked dataset with chunks of around 300MB which consists of about 600 images each. This will make it so processing can be done upon smaller chunks on machines without using so much memory for single planes/recordings and, especially important, make it easy to display data on people's desktops in the office. If one day multiple channels are to be recorded (I've seen up to 6 at once in some forum posts!) consistently and we do decide to keep the OME information for a given session we'll probably want to get that data out of the xml and build chunked datasets automatically.

I've also commented on a Suite2p issue here to see if the developers/users believe that having chunked datasets makes sense/would be helpful. Other people have told me that it makes sense to do so, but the user in this question is somehow recording over 1 million images in a recording which is just bonkers. I don't know how in the world they're processing that... Regardless, Carsen Stringer, the lead developer from Janelia, basically said that making stacks is annoying (which it certainly is turning out to be...) and so maybe doing this is unnecessary. Will update this upon hearing answers...

For now, here's what I propose be done in the near future:

[x] Make sure chunking is not interfering with h5 file creation when using Deryn's code
[ ] See how multi-channel datasets are interpreted when writing out h5 chunks
[ ] Parse out OME-xml and create independent datasets with retained OME information for each channel
[x] See if creating tiff stacks is required for creating appropriately sized chunks using dask
[x] Pending above, write out tiffs to stacks in parallel for channels/planes/subjects
[x] Benchmark creation of h5 files with Dask vs MATLAB
[x] Benchmark creation of zarr single core
[x] Benchmark creation of zarr multi-threaded
[x] Benchmark creation of zarr multi-processed

jmdelahanty commented 2 years ago

A few things discovered yesterday/yesterday evening:

On tako, converting data from the individual tiffs into chunked datasets using MATLAB takes about 25 minutes and yields an h5 file that is about 95GB. This makes me think that something is off about how I used the h5create functionality in MATLAB because, if anything, it should be about the same size as the original (about 23GB for one channel). This is something that requires more investigation...
Using MATLAB's compression argument in h5create called Deflate appears to substantially slow down the speed of converting the OME tiffs into h5 files for some reason. I'm thinking I'm using the Deflate command incorrectly or something...
Using dask and zarr together is UNBELIEVABLY FAST. MIND BLOWINGLY FAST. At least it is on a local file system with an NVMe drive available on a powerful machine like tako. It needs to be benchmarked elsewhere and time must be measured across multiple platforms/machines. Ideally data can be stored on the server for file creation times and then displayed in a graph for people to check out later or something... Converting one channel from many tiffs into chunked, compressed zarrs takes approximately one minute and yields zarr arrays totaling approximately 12.5GB. Given such fast performance is possible using zarr for file creation it could be very worth it to see if we can run suite2p using zarrs for parallel processing...

EDIT:

Link to discussion on Image.sc describing speeds of using H5/zarr for posterity.

jmdelahanty commented 2 years ago

Tiff stacks will not be generated, instead we will be writing to either H5 or zarr. It would be best to retain the OME information if possible, but how to best do that is not yet clear.

jmdelahanty commented 2 years ago

For posterity, including Image.sc post describing this fast stuff and also that MATLAB is ballooning file sizes 4x due to it's storage of datasets as 64float. You could run this as script if you wanted or just enter each of these statements below into a python command line.

If you observe an h5 file that was generated by MATLAB's h5 writer with default settings (I could never seem to figure out how to change the datatype it writes to...) in Python using something like this:


import numpy as np
import h5py
from pathlib import Path

h5_path = Path("/path/to/h5/file")

h5_file = h5py.File(h5_path)

# I don't know if the data fields used in all of these are the same, so do this to find out what they are:

print(h5_file.keys())

# Gives the result of what's contained in the dataset keys.
# You should only do that above print if you don't know what the keys of the data are.
# If you know already, then comment out that line. 
# The result will look something like: <KeysViewHDF5 ['key_name']>
# Substitute the actual key name with the 'key_name' in the brackets.
# Be sure to have the key name in quotes (single or double works fine)

# This will show you the "shape" of the array, meaning it's dimensions. These 2p recordings are in 3D: Time, X, Y
print(h5_file['key_name'].shape)

# This will show you the datatype of the data in the file. Note that the original tiffs are in `uint16`.
print(h5_file['key_name'].dtype)

# The result here will say float64. This means there's about a 4x increase in data size.
# It also means that there is a mismatch between the original datatype's precision and the output data
# This probably doesn't mean anything for the analysis part, but is not efficiently using storage.

weiglszonja commented 1 year ago

@jmdelahanty did you also work with multichannel ome tiff datasets? It's interesting what you mentioned about the OME metadata in the master OME.tif file:

Bruker outputs OME.tiff files after the ripper has completed but only the first image that's written is the OME.tif master file. This contains any additional channels that are recorded at the time. So if two channels were recorded at once, the OME.tifs that are generated will ONLY have the master-file generated for Ch1 which contains all the data for any subsequent channels inside recorded during a session.

I received a dual color example recently, and I was expecting to see an indication of the presence of two channels in the XML file, but as I iterated over the File tags in the file, I did not find any indication of a second channel. I'm not sure how to be able to tell the difference from a single channel. I was wondering whether we are missing these files, or there is something else going on here.

jmdelahanty commented 1 year ago

I spent a little time today looking for the source of this line:

So if two channels were recorded at once, the OME.tifs that are generated will ONLY have the master-file generated for Ch1 which contains all the data for any subsequent channels inside recorded during a session.

But I haven't been able to locate it. In a couple minutes I'll take a super small recording with 2 channels and can share what I find here.

Here's an example .xml file with a fake recording I took today (at the bottom). Parsing it properly is surprisingly difficult for me! The tldr is that it appears the very first tif still has both channels encoded in it. Here's what I've done so far.

import tiffffile
import xml.etree.ElementTree as ET

with tiffffile.TiffFile("/path/to/tiff") as tif:
    metadata = tif.pages[0].tags["ImageDescription"]

root = ET.fromstring(metadata.value)

# I dumped this root into an xml file which I renamed to a .txt file, hopefully it's readable for you!
# I did that with this:
# tree = ET.ElementTree(ET.fromstring(metadata.value))
# tree.write("out.xml")

# Can't for the life of me parse this dang thing properly!
# If you know how I'd be happy to see how to parse it without using xpaths

print(root[0][1][0].items()
# [('ID', 'Channel:1:0'), ('Name', 'Ch1'), ('SamplesPerPixel', '1')]
print(root[0][1][1].items()
# [('ID', 'Channel:1:1'), ('Name', 'Ch2'), ('SamplesPerPixel', '1')]

Is this helpful?

out.txt

To be clear, this is the first tiff of the recording which has tags inside it formatted as an xml string. I couldn't figure out how to access that directly, so I dumped it to an xml file that I've shared here.

weiglszonja commented 1 year ago

Thank you @jmdelahanty for looking into this! I think the contents of "ImageDescription" tag are identical to tif.ome_metadata.

This is how I was trying to parse it, but in our dual color example "SizeC" is 1, but I expected that it will be 2.

ome_xml_metadata = tif.ome_metadata
ome_metadata_element = ElementTree.fromstring(ome_xml_metadata)
tree = ElementTree.ElementTree(ome_metadata_element)
ome_metadata_root = tree.getroot()
schema_name = re.findall("\{(.*)\}", ome_metadata_root.tag)[0]
pixels_element = ome_metadata_root.find(
        f"{{{schema_name}}}Image/{{{schema_name}}}Pixels")
ome_metadata_root.findall(
        f"{{{schema_name}}}Image/{{{schema_name}}}Pixels/{{{schema_name}}}Channel")
print("Dual color single plane")
print("num frames: ", int(pixels_element.attrib["SizeT"]))
print("X (pixels): ", int(pixels_element.attrib["SizeX"]))
print("X (microns): ", float(pixels_element.attrib["PhysicalSizeX"]))
print("Y (pixels): ", int(pixels_element.attrib["SizeY"]))
print("Y (microns): ", float(pixels_element.attrib["PhysicalSizeY"]))
print("number of planes: ", int(pixels_element.attrib["SizeZ"]))
print("Z (microns): ", float(pixels_element.attrib["PhysicalSizeZ"]))
print("number of channels: ", int(pixels_element.attrib["SizeC"]))
print("dtype of frames: ", pixels_element.attrib["Type"])

Now I wonder how the XML file comes into the picture, so far the Bruker TIFF format that we've seen with @CodyCBakerPhD are the numerous .ome.tif files (one file per frame) and an accompanying XML file. When you created this recording did it also create an XML file similar to our case?

jmdelahanty commented 1 year ago

Thanks for the snippet looking into this! I've been struggling to properly access that information and it looks like that is indeed how to do it better.

It did! All Bruker recordings have an associated master .env and .xml associated with them. Their "ripper" produces individual .ome.tif files currently although as of version 5.8.64.100 they have a secondary converter that can take those tiffs and produce multipage tiffs instead. It's definitely better to use formats like h5 since processing workflows are slow when reading so many little files not to mention having to actually manage them in your storage infrastructure. Christophe Golke, author of tifffile showed me how to get all those tiffs into a nice H5 file super fast with Dask here which might be interesting to you!

Here's that xml file as a .txt: master.txt

And here's the env file as a .txt: env.txt

weiglszonja commented 1 year ago

Thank you @jmdelahanty this is incredibly useful! I think we will keep the format as close to the "raw" output as possible, but it's definitely good to know.

weiglszonja commented 1 year ago

@jmdelahanty I was also wondering based on your experience with Bruker, is it possible to define disjoint z planes (defining a gap between the z planes) or they always correspond to a volume?

jmdelahanty commented 1 year ago

In our lab's use of one of our Bruker scopes thus far, we image one plane at a time and do a set of trials at each plane. Our new Bruker system, the 2P+, has an electrically tunable lens. Deryn (@doleduke), one of our lab's graduate students, is going to be using it in the future! I'll also tag one of Bruker's helpful service engineers Kevin Mann (@mannk2) who would know. I don't know how often they check their GitHub notifications, but I can ask Deryn if she knows yet and also email Kevin from Bruker about it. ~Could I have your email and I can try putting you in touch with everyone?~ EDIT: I used the email I found on your GitHub profile. Let me know if there's a different one I should try! They're both super busy so I can't promise they'll get back to you, but it's worth a try!

weiglszonja commented 1 year ago

Hey, thank so much, could you add my work email too?

jmdelahanty commented 1 year ago

Done!

weiglszonja commented 1 year ago

@jmdelahanty Could you confirm for us with version >= 5.8. we can expect multipage tiffs only, or is it still possible for the user to write individual files? I'm just thinking whether we can trust the version number to infer whether we have multipage tiffs or just read the first .ome.tif and check the number of pages within.

It did! All Bruker recordings have an associated master .env and .xml associated with them. Their "ripper" produces individual .ome.tif files currently although as of version 5.8.64.100 they have a secondary converter that can take those tiffs and produce multipage tiffs instead.

jmdelahanty commented 1 year ago

Hello!

I literally just tested this a couple hours ago through this repo/docker!

The default "ripper" still produces individual ome.tif files. There's a second executable that can take those tiffs and place them into multi-page tiffs. I would imagine most users probably don't even know that multi-page tif writing is even possible with the newer updates since I don't know how closely people read changelogs from Bruker for their systems. The author of Suite2p (Carsen Stringer), a super common processing pipeline, doesn't suggest you use single page tifs unless you have to as with the Bruker case (see here).

At some point in the future, Bruker is planning on doing online, multi-page tif writing by default unless a legacy machine can't keep up with that. So eventually you'll probably have to handle those outputs. Could link my docs describing that message from their lead dev Michael Fox.

Tyelab / bruker_pipeline

OME Tiff Information: Do we want to keep it? Also H5 Writer defaults balloon file size, see last comment #9