Open jmdelahanty opened 2 years ago
A few things discovered yesterday/yesterday evening:
h5
file that is about 95GB. This makes me think that something is off about how I used the h5create
functionality in MATLAB because, if anything, it should be about the same size as the original (about 23GB for one channel). This is something that requires more investigation...h5create
called Deflate
appears to substantially slow down the speed of converting the OME tiffs into h5 files for some reason. I'm thinking I'm using the Deflate
command incorrectly or something...dask
and zarr
together is UNBELIEVABLY FAST. MIND BLOWINGLY FAST. At least it is on a local file system with an NVMe drive available on a powerful machine like tako. It needs to be benchmarked elsewhere and time must be measured across multiple platforms/machines. Ideally data can be stored on the server for file creation times and then displayed in a graph for people to check out later or something... Converting one channel from many tiffs into chunked, compressed zarrs takes approximately one minute and yields zarr arrays totaling approximately 12.5GB. Given such fast performance is possible using zarr for file creation it could be very worth it to see if we can run suite2p
using zarrs for parallel processing...EDIT:
Link to discussion on Image.sc describing speeds of using H5/zarr for posterity.
Tiff stacks will not be generated, instead we will be writing to either H5 or zarr. It would be best to retain the OME information if possible, but how to best do that is not yet clear.
For posterity, including Image.sc post describing this fast stuff and also that MATLAB is ballooning file sizes 4x due to it's storage of datasets as 64float
. You could run this as script if you wanted or just enter each of these statements below into a python command line.
If you observe an h5 file that was generated by MATLAB's h5 writer with default settings (I could never seem to figure out how to change the datatype it writes to...) in Python using something like this:
import numpy as np
import h5py
from pathlib import Path
h5_path = Path("/path/to/h5/file")
h5_file = h5py.File(h5_path)
# I don't know if the data fields used in all of these are the same, so do this to find out what they are:
print(h5_file.keys())
# Gives the result of what's contained in the dataset keys.
# You should only do that above print if you don't know what the keys of the data are.
# If you know already, then comment out that line.
# The result will look something like: <KeysViewHDF5 ['key_name']>
# Substitute the actual key name with the 'key_name' in the brackets.
# Be sure to have the key name in quotes (single or double works fine)
# This will show you the "shape" of the array, meaning it's dimensions. These 2p recordings are in 3D: Time, X, Y
print(h5_file['key_name'].shape)
# This will show you the datatype of the data in the file. Note that the original tiffs are in `uint16`.
print(h5_file['key_name'].dtype)
# The result here will say float64. This means there's about a 4x increase in data size.
# It also means that there is a mismatch between the original datatype's precision and the output data
# This probably doesn't mean anything for the analysis part, but is not efficiently using storage.
@jmdelahanty did you also work with multichannel ome tiff datasets? It's interesting what you mentioned about the OME metadata in the master OME.tif file:
Bruker outputs OME.tiff files after the ripper has completed but only the first image that's written is the OME.tif master file. This contains any additional channels that are recorded at the time. So if two channels were recorded at once, the OME.tifs that are generated will ONLY have the master-file generated for Ch1 which contains all the data for any subsequent channels inside recorded during a session.
I received a dual color example recently, and I was expecting to see an indication of the presence of two channels in the XML file, but as I iterated over the File
tags in the file, I did not find any indication of a second channel.
I'm not sure how to be able to tell the difference from a single channel.
I was wondering whether we are missing these files, or there is something else going on here.
I spent a little time today looking for the source of this line:
So if two channels were recorded at once, the OME.tifs that are generated will ONLY have the master-file generated for Ch1 which contains all the data for any subsequent channels inside recorded during a session.
But I haven't been able to locate it. In a couple minutes I'll take a super small recording with 2 channels and can share what I find here.
Here's an example .xml file with a fake recording I took today (at the bottom). Parsing it properly is surprisingly difficult for me! The tldr is that it appears the very first tif still has both channels encoded in it. Here's what I've done so far.
import tiffffile
import xml.etree.ElementTree as ET
with tiffffile.TiffFile("/path/to/tiff") as tif:
metadata = tif.pages[0].tags["ImageDescription"]
root = ET.fromstring(metadata.value)
# I dumped this root into an xml file which I renamed to a .txt file, hopefully it's readable for you!
# I did that with this:
# tree = ET.ElementTree(ET.fromstring(metadata.value))
# tree.write("out.xml")
# Can't for the life of me parse this dang thing properly!
# If you know how I'd be happy to see how to parse it without using xpaths
print(root[0][1][0].items()
# [('ID', 'Channel:1:0'), ('Name', 'Ch1'), ('SamplesPerPixel', '1')]
print(root[0][1][1].items()
# [('ID', 'Channel:1:1'), ('Name', 'Ch2'), ('SamplesPerPixel', '1')]
Is this helpful?
To be clear, this is the first tiff of the recording which has tags inside it formatted as an xml string. I couldn't figure out how to access that directly, so I dumped it to an xml file that I've shared here.
Thank you @jmdelahanty for looking into this!
I think the contents of "ImageDescription" tag are identical to tif.ome_metadata
.
This is how I was trying to parse it, but in our dual color example "SizeC" is 1, but I expected that it will be 2.
ome_xml_metadata = tif.ome_metadata
ome_metadata_element = ElementTree.fromstring(ome_xml_metadata)
tree = ElementTree.ElementTree(ome_metadata_element)
ome_metadata_root = tree.getroot()
schema_name = re.findall("\{(.*)\}", ome_metadata_root.tag)[0]
pixels_element = ome_metadata_root.find(
f"{{{schema_name}}}Image/{{{schema_name}}}Pixels")
ome_metadata_root.findall(
f"{{{schema_name}}}Image/{{{schema_name}}}Pixels/{{{schema_name}}}Channel")
print("Dual color single plane")
print("num frames: ", int(pixels_element.attrib["SizeT"]))
print("X (pixels): ", int(pixels_element.attrib["SizeX"]))
print("X (microns): ", float(pixels_element.attrib["PhysicalSizeX"]))
print("Y (pixels): ", int(pixels_element.attrib["SizeY"]))
print("Y (microns): ", float(pixels_element.attrib["PhysicalSizeY"]))
print("number of planes: ", int(pixels_element.attrib["SizeZ"]))
print("Z (microns): ", float(pixels_element.attrib["PhysicalSizeZ"]))
print("number of channels: ", int(pixels_element.attrib["SizeC"]))
print("dtype of frames: ", pixels_element.attrib["Type"])
Now I wonder how the XML file comes into the picture, so far the Bruker TIFF format that we've seen with @CodyCBakerPhD are the numerous .ome.tif
files (one file per frame) and an accompanying XML file.
When you created this recording did it also create an XML file similar to our case?
Thanks for the snippet looking into this! I've been struggling to properly access that information and it looks like that is indeed how to do it better.
It did! All Bruker recordings have an associated master .env
and .xml
associated with them. Their "ripper" produces individual .ome.tif
files currently although as of version 5.8.64.100 they have a secondary converter that can take those tiffs and produce multipage tiffs instead. It's definitely better to use formats like h5
since processing workflows are slow when reading so many little files not to mention having to actually manage them in your storage infrastructure. Christophe Golke, author of tifffile
showed me how to get all those tiffs into a nice H5 file super fast with Dask here which might be interesting to you!
Here's that xml file as a .txt: master.txt
And here's the env file as a .txt: env.txt
Thank you @jmdelahanty this is incredibly useful! I think we will keep the format as close to the "raw" output as possible, but it's definitely good to know.
@jmdelahanty I was also wondering based on your experience with Bruker, is it possible to define disjoint z planes (defining a gap between the z planes) or they always correspond to a volume?
In our lab's use of one of our Bruker scopes thus far, we image one plane at a time and do a set of trials at each plane. Our new Bruker system, the 2P+, has an electrically tunable lens. Deryn (@doleduke), one of our lab's graduate students, is going to be using it in the future! I'll also tag one of Bruker's helpful service engineers Kevin Mann (@mannk2) who would know. I don't know how often they check their GitHub notifications, but I can ask Deryn if she knows yet and also email Kevin from Bruker about it. ~Could I have your email and I can try putting you in touch with everyone?~ EDIT: I used the email I found on your GitHub profile. Let me know if there's a different one I should try! They're both super busy so I can't promise they'll get back to you, but it's worth a try!
Hey, thank so much, could you add my work email too?
Done!
@jmdelahanty Could you confirm for us with version >= 5.8. we can expect multipage tiffs only, or is it still possible for the user to write individual files? I'm just thinking whether we can trust the version number to infer whether we have multipage tiffs or just read the first .ome.tif and check the number of pages within.
It did! All Bruker recordings have an associated master .env and .xml associated with them. Their "ripper" produces individual .ome.tif files currently although as of version 5.8.64.100 they have a secondary converter that can take those tiffs and produce multipage tiffs instead.
Hello!
I literally just tested this a couple hours ago through this repo/docker!
The default "ripper" still produces individual ome.tif files. There's a second executable that can take those tiffs and place them into multi-page tiffs. I would imagine most users probably don't even know that multi-page tif writing is even possible with the newer updates since I don't know how closely people read changelogs from Bruker for their systems. The author of Suite2p (Carsen Stringer), a super common processing pipeline, doesn't suggest you use single page tifs unless you have to as with the Bruker case (see here).
At some point in the future, Bruker is planning on doing online, multi-page tif writing by default unless a legacy machine can't keep up with that. So eventually you'll probably have to handle those outputs. Could link my docs describing that message from their lead dev Michael Fox.
Bruker outputs OME.tiff files after the ripper has completed but only the first image that's written is the
OME.tif
master file. This contains any additional channels that are recorded at the time. So if two channels were recorded at once, the OME.tifs that are generated will ONLY have the master-file generated forCh1
which contains all the data for any subsequent channels inside recorded during a session.I'm asking questions in a gitter chatroom for zarr developers who have been helpful so far but I haven't heard back about a couple other new things yet.
I'm currently able to parse the
OME
xml properly but am still learning about the different image tag numbers that are used. There doesn't appear to be a clear explanation for what these different tags are or what other tag information is available in the OME-tiff master file. Here's what I've found out so far about the structure:As of 12/13/21, a method for parsing an OME-tiff master file structured as this one that will preserve the metadata of tiffs for chunked HDF5/zarr files has not been found.
Good news is that it appears that Deryn's code in MATLAB using the
h5write
function does know how to create chunks of data from the many tiffs! It will take some additional testing to make sure it's doing what we expect, but MATLAB may have solved this issue for us thankfully. Another piece of good news is that we can likely run multiple copies MATLAB at once because it looks like each conversion to H5 only uses 1 CPU for a conversion (probably because it's I/O dependent).Unfortunately, it looks like MATLAB's
h5write
doesn't seem to know how to keep the OME metadata. It doesn't seem like not having the metadata gets in the way of any processing or sharing and I'm currently unaware ofNWB
usesOME
data/can store it somehow. I've opened a discussion here to find out more.However the tiffs are processed, each channel should have it's own chunked dataset with chunks of around 300MB which consists of about 600 images each. This will make it so processing can be done upon smaller chunks on machines without using so much memory for single planes/recordings and, especially important, make it easy to display data on people's desktops in the office. If one day multiple channels are to be recorded (I've seen up to 6 at once in some forum posts!) consistently and we do decide to keep the OME information for a given session we'll probably want to get that data out of the xml and build chunked datasets automatically.
I've also commented on a Suite2p issue here to see if the developers/users believe that having chunked datasets makes sense/would be helpful. Other people have told me that it makes sense to do so, but the user in this question is somehow recording over 1 million images in a recording which is just bonkers. I don't know how in the world they're processing that... Regardless, Carsen Stringer, the lead developer from Janelia, basically said that making stacks is annoying (which it certainly is turning out to be...) and so maybe doing this is unnecessary. Will update this upon hearing answers...
For now, here's what I propose be done in the near future:
dask