Closed noemifrisina closed 3 months ago
Could you share the full paths to those example files?
You have listed datasets that we try to give a sensible fixed length for, rather than have it be a variable-length. I think we could just take out the lengths in the dataset definitions.
I don't really understand what the actual problem is with fixed length strings, though. What is the error?
The main problem is that autoPROC fails when trying to open master files linked to those fields as they are now.
You can find an example here /dls/mx-scratch/yqf46943/error-h5
Since the error wasn't very specific, @rtuck99 run it into a debugger and turns out that the problem comes from the HDF5 library. Here's the error trace for one of them, its seems to be a conversion issue
################# File = /data/20240429_BMS-0003_11_1_1_master.h5
>>> Image format detected as HDF5/Eiger
Breakpoint 2, H5E_printf_stack (estack=estack@entry=0x0, file=file@entry=0x5620c37ca465 "H5Tconv.c", func=func@entry=0x5620c37cda78 <__func__.475> "H5T__conv_s_s", line=line@entry=4500, cls_id=792633534417207296, maj_id=864691128455135261, min_id=864691128455135388, fmt=0x5620c37cab60 "The library doesn't convert between strings of ASCII and UTF") at H5Eint.c:672
672 {
(gdb) bt
#0 H5E_printf_stack (estack=estack@entry=0x0, file=file@entry=0x5620c37ca465 "H5Tconv.c",
func=func@entry=0x5620c37cda78 <__func__.475> "H5T__conv_s_s", line=line@entry=4500,
cls_id=792633534417207296, maj_id=864691128455135261, min_id=864691128455135388,
fmt=0x5620c37cab60 "The library doesn't convert between strings of ASCII and UTF") at H5Eint.c:672
#1 0x00005620c365c9e7 in H5T__conv_s_s (src_id=<optimized out>, dst_id=216172782113784136,
cdata=0x5620c42dfce0, nelmts=0, buf_stride=0, bkg_stride=<optimized out>, buf=0x0, bkg=0x0)
at H5Tconv.c:4500
#2 0x00005620c3644ec7 in H5T__path_find_real (src=src@entry=0x5620c42d32d0,
dst=dst@entry=0x5620c42dfa50, name=name@entry=0x0, conv=conv@entry=0x7ffe1ba51450) at H5T.c:4875
#3 0x00005620c36471b7 in H5T_path_find (src=src@entry=0x5620c42d32d0, dst=dst@entry=0x5620c42dfa50)
at H5T.c:4679
#4 0x00005620c346742f in H5D__typeinfo_init (dset=dset@entry=0x5620c42c7470,
mem_type_id=mem_type_id@entry=216172782113784134, do_write=do_write@entry=false,
type_info=type_info@entry=0x7ffe1ba51560) at H5Dio.c:976
#5 0x00005620c34689a7 in H5D__read (dataset=dataset@entry=0x5620c42c7470,
mem_type_id=mem_type_id@entry=216172782113784134, mem_space=0x5620c42ca0f0,
file_space=0x5620c42ca0f0, buf=buf@entry=0x5620c42dfc00) at H5Dio.c:462
#6 0x00005620c34694d5 in H5Dread (dset_id=<optimized out>, mem_type_id=216172782113784134,
mem_space_id=0, file_space_id=0, dxpl_id=720575940379279368, buf=0x5620c42dfc00) at H5Dio.c:197
#7 0x00005620c335af2f in hdf5_read_char (fid=72057594037927936,
item=0x5620c37514f8 "/entry/instrument/detector/detectorSpecific/data_collection_date")
at imginfo.c:2420
#8 0x00005620c335305c in get_header_eiger (
path=0x7ffe1ba568d5 "/data/20240429_BMS-0003_11_1_1_master.h5", imgnum=0, h=0x7ffe1ba51e90)
at imginfo.c:917
#9 0x00005620c335248e in get_header (buffer=0x7ffe1ba51fc0 "\211HDF\r\n\032\n", h=0x7ffe1ba51e90,
path=0x7ffe1ba568d5 "/data/20240429_BMS-0003_11_1_1_master.h5", imgnum=0) at imginfo.c:695
#10 0x00005620c3351fa1 in main (argc=0, argv=0x7ffe1ba56108) at imginfo.c:570
OK. From that it looks like the problem is it that the strings are UTF-8, but only for the fixed-length strings and not variable-length strings? I am not sure why that would be the case.
It seems to me it is the reading application that is problem, but without seeing the code I am not sure. h5dump, h5ls and h5py can read these datasets fine. That being said, if it fixes this issue we can change the fixed-length string datasets to also be variable-length - there is no particular reason for them to be fixed length, it just seemed more optimal.
create_dataset("name", data=np.string_(name_str))
This is where the string conversion is done with h5py: https://github.com/odin-detector/odin-data/blob/master/python/src/odin_data/meta_writer/hdf5dataset.py#L417C22-L417C71
This is where these datasets are defined: https://github.com/dls-controls/eiger-detector/blob/master/python/src/eiger_detector/data/eiger_meta_writer.py
We would like to try changing all to length=None
and encoding="ascii"
As part of this I will also change the types where they are obviously wrong e.g. omega_start is a float not an int
Upon further investigation we need to keep the fixed length of the strings. However, we would like there to be no chunking in this case, see https://diamondlightsource.slack.com/archives/G017HTC55M1/p1716567840558099
I think with the linked PR merged that we can call this done. Chunking discussion is for another day
Currently all the datasets the
_dectris
group of the*_meta.h5
written for Eiger detectors which should be strings are stored as a chunked array of null-padded strings. This is causing processing to fail as the master file written at collection time links to some of these fields but the HDF5 library raises an error when trying to convert between this data type and ASCII.Datasets that should be strings (may have missed a couple):
detector_number
,sensor_material
,eiger_fw_version
,data_collection_date
,description
,software_version
,trigger_mode
What the datasets look like now:
Please can these fields be saved as unicode strings, same as the
name
dataset is