Open acardona opened 2 years ago
I agree with all of those issues, but I don't think we should consider modifying the DAT format except as a last resort. I believe the very existence of the DAT format it itself the problem. The features and flexibility you want already exist in a variety of array storage formats (nrrd, hdf5, zarr, even tif), so the FIBSEM devices should just save data in one of those formats. Unfortunately this is not a programming problem but rather a social / organizational one -- the various stakeholders (i.e., lab heads) need to convince those people with power over the FIBSEM acquisition software to use a standard format.
enable acquisition software to store only one channel (to save 50% of space: we never use the second channel).
This would be a huge win for storage. We have also never used the second channel.
We are not likely to be able change the format saved at acquisition.
That leaves several remaining questions:
I have some ideas for 1 and 2, but we are only really considering 3 and 4 at the moment.
A few months ago @d-v-b's pipeline did involve rolling the axes so that the data in the two channels would be contiguous and not interleaved. However, the current code no longer does this: https://github.com/janelia-cosem/fibsem-tools/commit/d62c93d8cd0a58ae9766eff2fbb2f2bba2e3dbb6#diff-617ebadd3f040b05f0aff77a6fb09ba4f673b25df064e960061c33a4b40ccabaL619
Perhaps we should consider a metadata attribute that describes the axis order so that if we do decide to roll the axes again, we have a way of indicating that?
If we can't change the format saved at acquisition then IMO the next best thing would be a post-processor which took it from the current format to a single maximally-useful format, reliable enough that we could immediately delete the raw image. Otherwise we're 1.5x-ing our storage costs, not reducing them. Of course, if we're going to have a good, robust, reference implementation of a reader, we may as well make it an importable library as well as a binary, although running with minimal dependencies (or in an apptainer or something) would also be a goal. Certainly in a FOSS runtime.
To me, maximally useful means an HDF5 (stable, widely supported, self-documenting) with a dataset for each channel, where all of the metadata is present (and channel-specific metadata goes on the channel dataset) with explicit, canonical names. This should have as few options as possible to avoid the problem of Thousands of Incompatible File Formats: keep the output consistent and generic. It should be able to drop channels, and maybe add compression as well, but as I understand it the raw images compress very poorly and leaving the data raw may allow us to verify it's all allocated, at least, by checking the size. And raw means it can be memory mapped by other readers.
Immediately post-acquisition would probably be best, so long as the microscope isn't close to saturating the write speed of the kind of disks it's generally used on. If we wait until it's on the cluster, it's too late - getting it there is error-prone and high-speed cluster storage is (in our case) not sufficiently backed up, so we'd need to keep the originals. It's also very light on resources other than IO so there's no real need to do it on HPC, not to mention highly parallelisable.
Either way, a very minor change to the acquisition software which would be very useful would be to write the raw images to a file with some extension indicating that it's a partial file, then rename it as soon as it's done. Then it's much easier to track which files need to be post-processed, backed up, or moved - this is a problem we have already.
Thanks for the comments @clbarnes .
Regarding HDF5, this is the interface that @d-v-b is currently building: https://github.com/janelia-cosem/fibsem-tools/blob/master/src/fibsem_tools/io/h5.py
Further enhancements to create a context manager are here. If you would like feel free to review this pull request: https://github.com/janelia-cosem/fibsem-tools/pull/24
The approach involves reading in the DAT file using fibsem.py and then writing out a HDF5 file using the above code. https://github.com/janelia-cosem/fibsem-tools/blob/master/src/fibsem_tools/io/fibsem.py
A lightweight method that we could pursue involves using an external dataset. In this case, we could create a small HDF5 file containing the attribute metadata and point to the data in the existing DAT file. This avoids duplication of the image data.
https://docs.h5py.org/en/stable/high/dataset.html#h5py.Dataset.external https://portal.hdfgroup.org/display/HDF5/H5P_SET_EXTERNAL
I recently added support for H5P_SET_EFILE_PREFIX
to h5py. This makes it easier to locate the DAT file, including making it relative to the HDF5 file by using $ORIGIN
.
https://github.com/h5py/h5py/pull/2092
When imaging tens of terabytes one wants to minimize friction in the system. Not changing the LabView code and having a handler software that transforms the data seems like an unnecessary overhead.
We have the source code and the ability to change it: I can't see why not add a switch to not store the second channel, which immediately solves both the excess storage and the interleaved problem, and then another switch to save as unsigned 16-bit data type as opposed to signed like it does now. That alone would address all my concerns while maintaining backwards (and forward) compatibility with further additions to the DAT writer software.
We have the source code and the ability to change it: I can't see why not add a switch to not store the second channel, which immediately solves both the excess storage and the interleaved problem, and then another switch to save as unsigned 16-bit data type as opposed to signed like it does now. That alone would address all my concerns while maintaining backwards (and forward) compatibility with further additions to the DAT writer software.
If you can safely make these changes to the software (perhaps you have someone who knows labview well enough to do it, and there are no concerns about potential compatibility issues if Shan makes upstream changes to the acquisition software), then there is no reason to use DAT anymore -- just add an option to save as NRRD or hdf5 or zarr, with the "skip the second channel" and "save as uint16" features included. This would be the ideal outcome, and if this works for you perhaps we could convince the operators of our own FIB-SEM devices to use these changes.
@acardona, sorry if there is a misunderstanding, but we do not have the ability to modify the code that runs on our microscopes at the moment despite having this source code. We are currently limited to post processing only.
Perhaps @avweigel or @StephanPreibisch can explain the current situation in more detail?
For the moment, we can try to formulate what our desired outcome is. From both of your comments, this is perhaps a HDF5 file where we have split the channels into two datasets or perhaps even drop one of the channels. This then gives us access to tools such as HDF5-JSON in order to produce easy to access files.
We may hold off on dropping the channel for the moment because we would also like to prove that we can do a round trip from DAT to HDF5 and back to DAT again. After demonstrating this we will be able to erase the DAT files.
perhaps we could convince the operators of our own FIB-SEM devices to use these changes.
In the past I have found Shan Xu to be responsive to user feedback. Will try to ask him directly now.
I'm not clear if Shan remains in the operative decision hierarchy although I would still appreciate his comments.
Before trying to convince others, we should fully determine what is the minimum acceptable engineering target for the file format and specification.
Once we can agree on a spec, we can then start with post-acquisition software. From there we can see if we can optimize the process and/or move some of the process up the data stream.
A major hurdle in making changes to the acquisition software is that it is unclear what needs to be done if anything. The benefits to the Hess lab are unclear. What makes it necessary to do this during acquisition or near acquisition?
As an aside, I have been playing around today and made some progress, on this branch https://github.com/clbarnes/jfibsem_dat/pull/10:
Using @d-v-b's code, I produced a spec for each header version as a simple TSV, which looks like this
name dtype offset shape
FileMagicNum >u4 0 0
FileVersion >u2 4 0
FileType >u2 6 0
SWdate >S10 8 0
TimeStep >f8 24 0
ChanNum >u1 32 0
EightBit >u1 33 0
This is A) a good reference for anyone of any language and can easily be checked by non-developers, B) very easy to diff without language-specific stuff getting in the way, C) means that all versions can be supported in <100 lines (of python), with new versions simply a case of adding a new TSV to a folder. Using this, I can roundtrip the metadata to a python dict, and with another 20 lines, roundtrip through HDF5 attributes for a byte-identical output, theoretically for any current or future header version (I only have v8 to test on). Small, simple, maintainable; and non-developers could even update the spec themselves in excel or whatever. My vision would be that those TSVs live in a repo which is used as a git submodule by implementations using them.
This low-level representation could also easily be used to generate code for a higher-level representation (with type checking etc.) or for less dynamic languages.
I can also successfully round-trip the data array through HDF5, although it requires what I can only assume are a bunch of copies due to h5py normalising byte and/or axis orders.
Byte-identity on the whole file may not be possible because some dat files seem to have a post-array footer which I haven't seen any documentation for, although I suppose that could just be dumped into the HDF5 metadata as an array of uint8 and treated specially. But the metadata+data portion of it is going fine.
The benefits to the Hess lab are unclear.
The very nature of collaborative research means that everyone benefits from people using open standards and tools. Sometimes formats don't exist to represent what you need. That is clearly not the case here - HDF5 does exactly what we need and more, and has well-supported implementations for just about every language anyone might want to use, including for HPC access.
If the concept of open science is not enough to sway them, then perhaps "you would have saved yourself the time of every conversation you've ever had about this file format, past present and future" would?
@clbarnes I did prototype a Julia struct for fun based on the v9 header: https://github.com/mkitti/FIBSEMHeaders.jl/blob/main/src/FIBSEMHeaders.jl#L5
One difference is that I also added spacer fields to also ingest any spare bytes between the existing fields. Because of the spacer fields, I do not have to directly encode the offsets. They can derived from the size of the field types.
A complication that I found is that the in-memory representation of the struct may be different from byte offsets due to data structure alignment issues.
In the past I have found Shan Xu to be responsive to user feedback. Will try to ask him directly now.
That's a great idea, thank you. He is in the best position to add hdf5 writing support, and hopefully if he hears from enough people then it will be clear how valuable the effort is to people processing the data downstream.
Before trying to convince others, we should fully determine what is the minimum acceptable engineering target for the file format and specification.
Hard disagree. We are not grappling with a computer science / programming problem. We know that any number of standard formats for array data + metadata will work here, and the details can be hashed out once any one of those formats is picked. Instead, this is a social engineering problem ("how can we get someone who knows labview to implement hdf5 writing in the FIBSEM software") and perhaps it is impossible, or at least beyond our power.
If our efforts to change the labview code fail, then I second the effort to specify a language-agnostic definition of the dat header (for each version).
how can we get someone who knows labview to implement hdf5 writing in the FIBSEM software
@d-v-b , I already prototyped this and talked this through with Dan. I have a LabView module that can write a HDF5 file with the HDF5 library. This is not the problem.
Writing labview code is not the same as implementing it in the FIBSEM software
This is integrated in the FIBSEM software. See that "SW Date" with the globe. That's pulling a global variable out of the FIBSEM software.
Since we're talking about contacting Shan (who was also invited to this repo), here's my last correspondence with him.
From: Xu, C. Shan [xuc@janelia.hhmi.org](mailto:xuc@janelia.hhmi.org) Sent: Friday, April 22, 2022 21:58 To: Kittisopikul, Mark Andrew [kittisopikulm@janelia.hhmi.org](mailto:kittisopikulm@janelia.hhmi.org) Subject: Re: FileLength vs array data offset
Hi Mark,
I think it is up to the person who wants to maintain the software going forward. 😊 Just make sure to communicate with others who will process the data.
Thanks, Shan
From: Kittisopikul, Mark Andrew [kittisopikulm@janelia.hhmi.org](mailto:kittisopikulm@janelia.hhmi.org) Date: Friday, April 22, 2022 at 8:35 PM To: Xu, C. Shan [xuc@janelia.hhmi.org](mailto:xuc@janelia.hhmi.org) Subject: FileLength vs array data offset Hi Shan,
Thank you for the tour today through the LabView code. I can see the code encompasses many years of experience and engineering.
In the dat files, I noticed that the array data is always assumed to start at byte 1024 and end at FileLength (provided at byte 1000), where recipe information may be stored. In other contexts, I found it useful to encode the start of the array data explicitly which allows for the expansion of the header if needed. For example, I could foresee new microscope systems needing a distinct set of attributes separate from the ones currently used.
One approach might be to calculate the start of the array data from “FileLength” rather than assume the array data starts at byte 1024. Another approach would be to encode the array data start location within the header, perhaps at byte 992. Of course, one could do both and verify the two numbers are consistent with the array size.
Do you prefer one approach over the other?
Thanks again and have a nice weekend, -Mark
Here is the comment within the LabView source code that describes what all these attributes are.
From "Zeiss File Header Tclk.vi":
Tracking File Headers:
BLN: boolean (1 byte)
U8: unsigned 8-bit (1 byte)
U16: unsigned short (2 bytes)
U32: unsigned long (4 bytes)
DBL: double (8 bytes)
I16: signed short (2 bytes)
I32: signed long (4 bytes)
Byte Type Significance
0 U32 Magic Number, should be 0xd3edf5f2 (3555587570)
4 U16 Version number (9)
6 U16 File type. 1: SEM image
8 STR SW date
24 DBL AI sampling time increment
32 U8 Number of channels
33 BLN 8-bit data
36 SGL AI device scaling coefficients or 8-bit conversion coefficients (Io and k)
68 U8 Restart flag
69 U8 Stage move flag
70 I32 X coordinate of the first pixel
74 I32 Y coordinate of the first pixel
100 U32 X resolution
104 U32 Y resolution
108 U16 AI oversampling
111 U8 Scan speed (Zeiss #)
112 SGL Actual AO (scanning) rate
116 SGL Frame line rampdown ratio
120 SGL X min (V)
124 SGL X max (V)
128 SGL Det min (V)
132 SGL Det max (V)
136 U16 Decimating factor
151 BLN AI Ch1
152 BLN AI Ch2
153 BLN AI Ch3
154 BLN AI Ch4
155 STR Sample ID
180 STR Notes
SEM parameters
380 STR Detector A
390 STR Detector B
410 STR Detector C
430 STR Detector D
460 SGL Magnification
464 SGL Pixel size (nm)
468 SGL Working distance (mm)
472 SGL EHT (kV)
480 U8 Aperture number
481 U8 High current mode
490 SGL Probe current (A)
494 SGL Scan rotation (deg)
498 SGL Chamber vacuum (Torr)
502 SGL Gun vacuum (Torr)
510 SGL Beam shift X
514 SGL Beam shift Y
518 SGL Stigmation X
522 SGL Stigmation Y
526 SGL Aperture aligment X
530 SGL Aperture aligment Y
534 SGL Stage X (mm)
538 SGL Stage Y (mm)
542 SGL Stage Z (mm)
546 SGL Stage T (deg)
550 SGL Stage R (deg)
554 SGL Stage M (mm)
560 SGL Brightness (%), detector A
564 SGL Contrast (%), detector A
568 SGL Brightness (%), detector B
572 SGL Contrast (%), detector B
FIB parameters
600 U8 Mode
604 SGL Focus (V)
608 U8 Probe current number
620 SGL Emission current (A)
624 SGL rotation (deg)
628 SGL Aperture alignment X
632 SGL Aperture alignment Y
636 SGL Stigmation X
640 SGL Stigmation Y
644 SGL Beam shift X (um)
648 SGL Beam shift Y (um)
Milling parameters
652 U32 X resolution
656 U32 Y resolution
660 SGL X size (um)
664 SGL Y size (um)
668 SGL UL-Ang (deg)
672 SGL UR-Ang (deg)
676 SGL Line milling time (s)
680 SGL FIB FOV (um)
684 U16 Lines auto pause
686 BLN FIB PID on
689 U8 FIB PID measured (0: specimen, 1: beam dump)
690 SGL FIB PID target
694 SGL FIB PID target slope
698 SGL FIB PID P
702 SGL FIB PID I
706 SGL FIB PID D
800 STR Machine ID
850 SGL Temperature (F)
854 SGL Faraday cup current (nA)
858 SGL Specimen current during FIB (nA)
862 SGL Beam dump 1 current (nA)
866 SGL Specimen current during SEM (nA)
870 SGL Milling Z voltage (V)
874 SGL Focus index
878 U32 FIB slice #
882 SGL Beam dump 2 current (nA)
886 SGL Milling current (nA)
1000 DBL File length, position for the beginning of recipe variant
1024 I16 1-D array of n channels of unscaled measurement from DAQ
Here is the comment within the LabView source code that describes what all these attributes are.
Ah! That's basically what's been missing. To make doubly sure, everything is byte-bigendian and bit-littleendian, SGL is a 32-bit float, STR is a null-padded ASCII string assumed to continue until the next byte offset, and SWDate is encoded as "DD/MM/YYYY" (switching this to ISO-8601 would be great). IIRC FIB mode and Milling PID measured are a enums - what are the meanings of the possible values?
The screenshot above is from "Zeiss File Header Tclk.vi":
The context help for these functions may be helpful in understanding the above.
Write to Binary File: https://www.ni.com/docs/en-US/bundle/labview/page/glang/write_file.html
Set File Position: https://www.ni.com/docs/en-US/bundle/labview/page/glang/set_file_position.html
To make doubly sure, everything is byte-bigendian and bit-littleendian
Yes. The default for "Write to Binary File" is "big-endian, network order".
SGL is a 32-bit float
Yes. See https://www.ni.com/docs/en-US/bundle/labview/page/lvhowto/floating_point_numbers.html
STR is a null-padded ASCII string assumed to continue until the next byte offset
Correct
SWDate is encoded as "DD/MM/YYYY" (switching this to ISO-8601 would be great)
I am not sure where this is coming from at the moment. It is possible this date is being manually entered or is in an INI file.
FIB mode ... is an enum
Milling PID measured
This is internally "FIB PID measured"
@acardona During our Zoom meeting, one item you brought up was if the attributes could be read as text.
There is a tool hdf5json from The HDF Group that facilitates this. Here I used the "-d option" to only dump attributes and not the dataset values to JSON.
$ pip install h5json
$ python3 h5tojson.py -d Merlin-6257_21-05-20_125527.h5 > Merlin-6257_21-05-20_125527.json
Another tool useful tool is h5ls
:
h5ls -v Merlin-6257_21-05-20_125527.h5 > Merlin-6257_21-05-20_125527.h5ls
I also wrote hcl for interactively navigating hdf5 hierarchies and interrogating attributes etc. That there are lots of tools for working with HDF5 files in every way imaginable (and that writing more HDF5 tools are widely useful) is another reason to try to move towards that format rather than something needlessly proprietary.
I also wrote hcl for interactively navigating hdf5 hierarchies and interrogating attributes etc.
Have you found particular advantages of your tools over h5dump
and h5ls
from https://portal.hdfgroup.org/display/HDF5/HDF5+Command-line+Tools ?
unsigned 16-bit data type as opposed to signed like it does now
One thing I am considering if whether the scale-offset filter could be used with other tools to address the issue that @acardona raises. At the very least we could encode the minimum as an offset
within the scale-offset filter.
2. Deinterleave
This may be addressable using a virtual dataset with a hyperslab with stride = (1,1,2).
One approach I am investigating is the external dataset feature. This allows us to address a DAT file via a HDF5 interface. This avoids the problem of duplicating the image data. It also provides a space for us to store HDF5 attributes.
import h5py
import os
dat_file_basename = "Merlin-6257_21-05-20_125527_0-0-0"
dat_filename = dat_file_basename + ".dat"
# attrs = parse_header(dat_filename)
# Can be calculated from attributes
offset = 1024
shape = (3500, 7437, 2)
datatype_nbytes = 2
nbytes = shape[0] * shape[1] * shape[2] * datatype_nbytes
file_length = offset + nbytes
actual_file_length = os.path.getsize(dat_filename)
recipe_nbytes = actual_file_length - file_length
with h5py.File(dat_file_basename + "_external.h5", "w") as h5f:
header = h5f.create_dataset(
"0-0-0_header",
offset,
dtype='u1',
external=[(dat_filename, 0, offset)]
)
ds = h5f.create_dataset(
"0-0-0",
shape,
dtype='>i2',
external=[(dat_filename, offset, nbytes)]
)
recipe = h5f.create_dataset(
"0-0-0_recipe",
(recipe_nbytes,),
dtype='u1',
external=[(dat_filename, file_length, recipe_nbytes)]
)
# Add parsed attributes to dataset ds
# for key, value in
# ds.attrs[key] = value
The main advantage of this is that it allows to use and build tools that focus on reading HDF5 files rather than manipulating DAT files. If we combine the above with one of the DAT header readers or perhaps a LabView module that writes HDF5 attributes directly, then we may be able to eliminate the need for DAT readers by others. Since this approach avoids rewriting the image data, it is very fast. The actual file produced is about 2KB.
We could generate such HDF5 files very close to acquisition without significant resources.
$ h5ls -va Merlin-6257_21-05-20_125527_0-0-0_external.h5
Opened "Merlin-6257_21-05-20_125527_0-0-0_external.h5" with sec2 driver.
0-0-0 Dataset {3500/3500, 7437/7437, 2/2}
Location: 1:1480
Links: 1
Extern: 1 external file
ID DSet-Addr File-Addr Bytes File
---- ---------- ---------- ---------- -------------------------------------
#000 0 1024 104118000 Merlin-6257_21-05-20_125527_0-0-0.dat
---- ---------- ---------- ---------- -------------------------------------
Storage: 104118000 logical bytes, 104118000 allocated bytes, 100.00% utilization
Type: 16-bit big-endian integer
Address: UNDEF
0-0-0_header Dataset {1024/1024}
Location: 1:800
Links: 1
Extern: 1 external file
ID DSet-Addr File-Addr Bytes File
---- ---------- ---------- ---------- -------------------------------------
#000 0 0 1024 Merlin-6257_21-05-20_125527_0-0-0.dat
---- ---------- ---------- ---------- -------------------------------------
Storage: 1024 logical bytes, 1024 allocated bytes, 100.00% utilization
Type: native unsigned char
Address: UNDEF
0-0-0_recipe Dataset {4667/4667}
Location: 1:1832
Links: 1
Extern: 1 external file
ID DSet-Addr File-Addr Bytes File
---- ---------- ---------- ---------- -------------------------------------
#000 0 104119024 4667 Merlin-6257_21-05-20_125527_0-0-0.dat
---- ---------- ---------- ---------- -------------------------------------
Storage: 4667 logical bytes, 4667 allocated bytes, 100.00% utilization
Type: native unsigned char
Address: UNDEF
I went ahead and moved forward with writing the metadata as HDF5 attributes directly from LabView. If we can write these attributes into a HDF5 file at or near acquisition, that removes most of the need for header specifications or readers going forward.
I'm confused what attribute names we have decided to use. I just translated the wire names to snake_case for now. There are some discrepancies in the diagram. Y and Z seems swapped in some places. @d-v-b 's code has an offset at 980 for SEMSpecimenI
, which does not exist here nor in the diagram above.
A minimalist approach to acquisition software modification would involve creating a template through the LabView equivalent of this code.
import h5py
import os
dat_file_basename = "Merlin-6257_21-05-20_125527_0-0-0"
dat_filename = dat_file_basename + ".dat"
# attrs = parse_header(dat_filename)
# Can be calculated from attributes
offset = 1024
shape = (3500, 7437, 2)
datatype_nbytes = 2
nbytes = shape[0] * shape[1] * shape[2] * datatype_nbytes
file_length = offset + nbytes
actual_file_length = os.path.getsize(dat_filename)
recipe_nbytes = actual_file_length - file_length
with h5py.File(dat_file_basename + "_template.h5", "w", userblock_size = offset) as h5f:
# Create image dataset, fill with 1s for illustration
ds = h5f.create_dataset(
"0-0-0",
shape,
dtype='>i2',
fillvalue = 1
)
# Could use H5Pset_alloc_time early instead
ds[0] = 1
ds[shape[0]-1, shape[1]-1, shape[2]-1] = 1
# Create recipe dataset, fill with 2s for illustration
recipe = h5f.create_dataset(
"0-0-0_recipe",
(recipe_nbytes,),
dtype='u1',
fillvalue = 2
)
# Could use H5Pset_alloc_time early instead
recipe[0] = 2
recipe[recipe_nbytes-1] = 2
# Add parsed attributes to dataset ds
# for key, value in
# ds.attrs[key] = value
This produces a file with the following properties:
$ h5ls -va Merlin-6257_21-05-20_125527_0-0-0_template.h5
Opened "Merlin-6257_21-05-20_125527_0-0-0_template.h5" with sec2 driver.
0-0-0 Dataset {3500/3500, 7437/7437, 2/2}
Location: 1:800
Links: 1
Storage: 104118000 logical bytes, 104118000 allocated bytes, 100.00% utilization
Type: 16-bit big-endian integer
Address: 2048
0-0-0_recipe Dataset {4667/4667}
Location: 1:1400
Links: 1
Storage: 4667 logical bytes, 4667 allocated bytes, 100.00% utilization
Type: native unsigned char
Address: 104120048
and the following hexdump:
$ hexdump Merlin-6257_21-05-20_125527_0-0-0_template.h5
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000400 4889 4644 0a0d 0a1a 0000 0000 0800 0008
0000410 0004 0010 0000 0000 0400 0000 0000 0000
0000420 ffff ffff ffff ffff d52b 0634 0000 0000
0000430 ffff ffff ffff ffff 0000 0000 0000 0000
0000440 0060 0000 0000 0000 0001 0000 0000 0000
0000450 0088 0000 0000 0000 02a8 0000 0000 0000
0000460 0001 0001 0001 0000 0018 0000 0000 0000
0000470 0011 0010 0000 0000 0088 0000 0000 0000
0000480 02a8 0000 0000 0000 5254 4545 0000 0001
0000490 ffff ffff ffff ffff ffff ffff ffff ffff
00004a0 0000 0000 0000 0000 0430 0000 0000 0000
00004b0 0010 0000 0000 0000 0000 0000 0000 0000
00004c0 0000 0000 0000 0000 0000 0000 0000 0000
*
00006a0 0000 0000 0000 0000 4548 5041 0000 0000
00006b0 0058 0000 0000 0000 0020 0000 0000 0000
00006c0 02c8 0000 0000 0000 0000 0000 0000 0000
00006d0 2d30 2d30 0030 0000 2d30 2d30 5f30 6572
00006e0 6963 6570 0000 0000 0001 0000 0000 0000
00006f0 0038 0000 0000 0000 0000 0000 0000 0000
0000700 0000 0000 0000 0000 0000 0000 0000 0000
*
0000720 0001 0006 0001 0000 0100 0000 0000 0000
0000730 0001 0038 0000 0000 0301 0001 0000 0000
0000740 0dac 0000 0000 0000 1d0d 0000 0000 0000
0000750 0002 0000 0000 0000 0dac 0000 0000 0000
0000760 1d0d 0000 0000 0000 0002 0000 0000 0000
0000770 0003 0010 0001 0000 0910 0000 0002 0000
0000780 0000 0010 0000 0000 0005 0010 0001 0000
0000790 0202 0102 0002 0000 0100 0000 0000 0000
00007a0 0004 0008 0001 0000 0002 0000 0100 0000
00007b0 0008 0018 0000 0000 0103 0800 0000 0000
00007c0 0000 b6f0 0634 0000 0000 0000 0000 0000
00007d0 0000 0058 0000 0000 0000 0000 0000 0000
00007e0 0000 0000 0000 0000 0000 0000 0000 0000
*
0000830 4e53 444f 0001 0002 0008 0000 0000 0000
0000840 0320 0000 0000 0000 0000 0000 0000 0000
0000850 0000 0000 0000 0000 0000 0000 0000 0000
0000860 0010 0000 0000 0000 0578 0000 0000 0000
0000870 0000 0000 0000 0000 0000 0000 0000 0000
*
0000970 0000 0000 0000 0000 0001 0006 0001 0000
0000980 0100 0000 0000 0000 0001 0018 0000 0000
0000990 0101 0001 0000 0000 123b 0000 0000 0000
00009a0 123b 0000 0000 0000 0003 0010 0001 0000
00009b0 0010 0000 0001 0000 0000 0008 0000 0000
00009c0 0005 0010 0001 0000 0202 0102 0001 0000
00009d0 0002 0000 0000 0000 0004 0008 0001 0000
00009e0 0001 0000 0002 0000 0008 0018 0000 0000
00009f0 0103 bef0 0634 0000 0000 123b 0000 0000
0000a00 0000 0000 0000 0000 0000 0078 0000 0000
0000a10 0000 0000 0000 0000 0000 0000 0000 0000
*
0000c00 0100 0100 0100 0100 0100 0100 0100 0100
*
634c2f0 0202 0202 0202 0202 0202 0202 0202 0202
*
634d52b
The first column are the addresses in hexadecimal. 0x0000400
is the start of the HDF5 header. 0x0000c00
is where the image data starts as indicated by the 0x0100
. 0x634c2f0
is where the second dataset for recipe data is stored as indicated by the 0x0202
.
If we can generate this template just before acquisition starts and have the acquisition software open this file instead of creating a new one, then we only have to modify one value in LabView, changing 1024 to 3072 in "2D Scan Tclk.vi":
The current software with the above modification will then do the following:
0x0000400
).0x0000c00
) and write the image data.0x634c2f0
Alternatively, if we let the acquisition software run first with the 1024 -> 3072 change, we could just write in the bytes between 0x0000400
and 0x0000c00
after acquisition. Davis is not a fan of the second approach because of the possibility of corrupting the image data.
Once we have a HDF5 file, it not be hard to reprocess the file. The HDF5 library itself can use hyperslab selection to deinterlace the image quickly in C. Alternatively we can do this virtually by storing the hyperslab in a virtual dataset.
The code already seems to have a toggle to record one or two channels. This means the decision to record a channel is configurable by microscope operator.
Both channels:
Detector A only:
Detector B only:
Here is a similar feature in another part of the code:
Thinking about it, the 4th picture is actually when the data is being fed into the write buffer, but the top 3 pictures include the portion where the data is actually being written.
The toggles for Detectors A and B are depicted below on the main control panel.
One significant difference between writing into HDF5 datasets and into the file as above is that writing into a HDF5 dataset is usually done at the scale of entire buffers: https://portal.hdfgroup.org/display/HDF5/H5D_WRITE
herr_t H5Dwrite( hid_t dataset_id, hid_t mem_type_id, hid_t mem_space_id, hid_t file_space_id, hid_t xfer_plist_id, const void * buf )
While it is possible to define a HDF5 space for integer by integer writing, this would be a significant modification of the I/O process.
Shan Xu replied to my email a couple of days ago: his software has already two checkboxes for whether a channel is to be saved or not. They are named "Det", and says Shan: "One can uncheck Det B (ESB) so that only Det A (InLens) is recorded."
In other words, not knowing well the UI cost us some storage costs, fortunately only for 2 smallish datasets. Now we can halve the storage and avoid the deinterleaving.
The signed vs unsigned can be solved by adding +32768 to the values, so it isn't onerous with an e.g., ImgLib2 converter filter upon reading the file.
The signed vs unsigned can be solved by adding +32768 to the values, so it isn't onerous with an e.g., ImgLib2 converter filter upon reading the file.
What are the pixel values actually measuring? Adding that constant obviously converts i16 to u16, but if they're both on arbitrary scales it doesn't really matter. The minimum values I recall seeing are nowhere near -intmax. If the original values are completely arbitrary, they should be written in u16; if they're not, we should try to keep them raw in case they correspond to the scaling values in the metadata; if they are not arbitrary but our contrast corrections are purely relative to other data values, then surely we can feed the i16 directly in rather than doing the addition and then exactly the same scaling?
Have you found particular advantages of your tools over h5dump and h5ls from https://portal.hdfgroup.org/display/HDF5/HDF5+Command-line+Tools ?
If I don't know what's in an HDF5 file, I find it more comfortable to interactively explore it rather than ask specific things one at a time. As an analogy, if I don't know what's in a server, I SSH into a shell on that server to explore it rather than running ssh $SERVER find . -type f
etc.. But it was at least partially just an excuse to learn prompt_toolkit
.
A minimalist approach to acquisition software modification would involve creating a template through the LabView equivalent of this code.
Thanks Mark! Looks like a move in the right direction. So the idea is the pre-allocate an empty HDF5 file and then write into that rather than building a file as we go? That seems like a good idea either way. It would definitely benefit from the unfinished file having an extra extension (.part
or whatever) and then being renamed at the very end of the process so that it's easy for filesystem tools to distinguish them.
Could you go into more detail about the recipe dataset? Is that another name for the header or is that additional? Some files seem to have a footer, is that related? My roundtrip code now includes the footer too.
Is there a benefit to writing the header into the userblock, as opposed to just writing it in as a uint8 array group attribute or even a uint8 dataset?
I'm confused what attribute names we have decided to use. I just translated the wire names to snake_case for now.
I don't have strong feelings about snake/Camel/ etc., I only went to snake for mine as I was changing the names anyway (to make them more explicit and re-order them so that e.g. resolution became a tuple). The important thing is making them more explicit: no sense in saving a few characters in variable names if the files can be made that much more self-documenting.
At the very least we could encode the minimum as an offset within the scale-offset filter.
If we're writing metadata at the end anyway, and we're going to be doing purely data-dependent filtering, then storing the actual recorded min and max values would definitely be helpful.
As far as I have seen, the pixel measurements are straighforward. They are saved as int16 instead of uint16 for mysterious reasons, likely it was the default or something and Shan went with that.
In our processing at the LMB, I compute the min, then I make all values in each image relative to that in uint16. Empirically works well and as expected.
Going forward, I'm just going to add 32768 when reading the pixels. Less processing.
A general comment: I cannot understand why you at Janelia think you can't change the software. Of course you can. Don't try to work with limitations that come to you from "above", when those above don't know or understand yet the issue in the first place.
The signed vs unsigned can be solved by adding +32768 to the values, so it isn't onerous with an e.g., ImgLib2 converter >filter upon reading the file.
What are the pixel values actually measuring? Adding that constant obviously converts i16 to u16, but if they're both on >arbitrary scales it doesn't really matter. The minimum values I recall seeing are nowhere near -intmax. If the original values >are completely arbitrary, they should be written in u16; if they're not, we should try to keep them raw in case they >correspond to the scaling values in the metadata; if they are not arbitrary but our contrast corrections are purely relative to >other data values, then surely we can feed the i16 directly in rather than doing the addition and then exactly the same >scaling?
If I recall correctly, the SEM detectors emit emit a voltage (i.e., a value that fluctuates around 0), which is then digitized to a range of integers that is also centered on 0 (int16). Via some other metadata in the dat header, this signed values can be mapped to unsigned electron counts. The only downside I can imagine to converting from int16 to uint16 is that it ultimately makes it a teeny bit harder to map intensities to electron counts (if thats something you want to do), but I don't think this is a significant burden.
If we're writing metadata at the end anyway, and we're going to be doing purely data-dependent filtering, then storing the actual recorded min and max values would definitely be helpful.
Yes, this would be super useful for the inevitable contrast normalization step, or converting from 16 bit ints down to 8 bit ints.
I don't have strong feelings about snake/Camel/ etc
Neither do I, either is totally fine with me.
I cannot understand why you at Janelia think you can't change the software. Of course you can. Don't try to work with limitations that come to you from "above", when those above don't know or understand yet the issue in the first place.
The answer is very simple: we have explained these issues, and the people in control of the acquisition software / devices did not want to change the software to address them. From their perspective, the status quo is acceptable. This limitation comes from "above", but that's fine -- this software is an important part of ensuring that the FIBSEM acquisition is stable, so the people managing acquisition should be cautious about changing it. So the challenge is convincing the people in charge of the acquisition machines that these software changes are a) safe, and b) important for accelerating science. I think we can do both, but it's not a trivial process.
A general comment: I cannot understand why you at Janelia think you can't change the software. Of course you can. Don't try to work with limitations that come to you from "above", when those above don't know or understand yet the issue in the first place.
As @d-v-b said, we need to carefully balance the needs of the labs that have a running pipeline, small projects, and large projects. We can discuss more in Berlin @acardona, but for now I am even happy with a robust post-acquisition resaving tool. In the long run it will convince people to change, I am sure of it ...
Going forward, I'm just going to add 32768 when reading the pixels. Less processing.
Be careful not to overflow the int16 in the process. Is it possible to do a bitcast to uint16 in your language of choice before adding this value?
About the HDF5 scale-offset filter:
https://confluence.hdfgroup.org/display/HDF5/H5P_SET_SCALEOFFSET
For this purpose, the offset part of this filter would allow us to define a fixed offset for a dataset. When someone reads the data using H5Dread
the filter will automatically add the offset. It does not actually change the values on disk.
Perhaps a better idea would be to process the data once and store the minimum value as an attribute so it does not need to happen on each read.
At the moment, when I open a .DAT file to read a single image channel, I have to:
Read the whole file with two channels, because their pixels are interleaved. This has costs both in storage (2 channels when we want 1), network transfer, and processing: Solution:
Deinterleave: transform an e.g., 1 GB array into two 0.5 GB arrays. Solution: see (1) above.
Convert from signed 16-bit to unsigned 16-bit, by iterating the whole channel image array twice: once to find the min value, a second time to subtract it from every pixel. Solution: store images directly as unsigned 16-bit.
In summary, at the very least enable .DAT format to use 16-bit unsigned, and to store only a single channel. This minimal change would halve our storage and network bandwidth costs, and reduce the post-processing steps by quite a bit.