Change the data type of datasets in `_dectris` group to string

noemifrisina commented 4 months ago

Currently all the datasets the _dectris group of the *_meta.h5 written for Eiger detectors which should be strings are stored as a chunked array of null-padded strings. This is causing processing to fail as the master file written at collection time links to some of these fields but the HDF5 library raises an error when trying to convert between this data type and ASCII.

Datasets that should be strings (may have missed a couple): detector_number, sensor_material, eiger_fw_version, data_collection_date, description, software_version, trigger_mode

What the datasets look like now:

 h5ls -dv ins_7_1_1_meta.h5/_dectris/data_collection_date
Opened "ins_7_1_1_meta.h5" with sec2 driver.
data_collection_date     Dataset {1/Inf}
    Location:  1:9800
    Links:     1
    Chunks:    {64} 6400 bytes
    Storage:   100 logical bytes, 6400 allocated bytes, 1.56% utilization
    Type:      100-byte null-padded UTF-8 string
    Data:
        (0) "2024-04-10T11:40:53.249+02:00" '\000' repeats 70 times

Please can these fields be saved as unicode strings, same as the name dataset is

h5ls -vd ins_1_22_1.nxs/entry/instrument/name
Opened "ins_1_22_1.nxs" with sec2 driver.
name                     Dataset {SCALAR}
    Attribute: short_name scalar
        Type:      7-byte null-padded ASCII string
        Data:  "DLS i03"
    Location:  1:128120
    Links:     1
    Storage:   20 logical bytes, 20 allocated bytes, 100.00% utilization
    Type:      20-byte null-padded ASCII string
    Data:
        (0) "DIAMOND BEAMLINE i03"

GDYendell commented 4 months ago

Could you share the full paths to those example files?

You have listed datasets that we try to give a sensible fixed length for, rather than have it be a variable-length. I think we could just take out the lengths in the dataset definitions.

I don't really understand what the actual problem is with fixed length strings, though. What is the error?

noemifrisina commented 4 months ago

The main problem is that autoPROC fails when trying to open master files linked to those fields as they are now. You can find an example here /dls/mx-scratch/yqf46943/error-h5

Since the error wasn't very specific, @rtuck99 run it into a debugger and turns out that the problem comes from the HDF5 library. Here's the error trace for one of them, its seems to be a conversion issue

################# File = /data/20240429_BMS-0003_11_1_1_master.h5
 >>> Image format detected as HDF5/Eiger
Breakpoint 2, H5E_printf_stack (estack=estack@entry=0x0, file=file@entry=0x5620c37ca465 "H5Tconv.c", func=func@entry=0x5620c37cda78 <__func__.475> "H5T__conv_s_s", line=line@entry=4500, cls_id=792633534417207296, maj_id=864691128455135261, min_id=864691128455135388, fmt=0x5620c37cab60 "The library doesn't convert between strings of ASCII and UTF") at H5Eint.c:672
672 {
(gdb) bt
#0  H5E_printf_stack (estack=estack@entry=0x0, file=file@entry=0x5620c37ca465 "H5Tconv.c",
    func=func@entry=0x5620c37cda78 <__func__.475> "H5T__conv_s_s", line=line@entry=4500,
    cls_id=792633534417207296, maj_id=864691128455135261, min_id=864691128455135388,
    fmt=0x5620c37cab60 "The library doesn't convert between strings of ASCII and UTF") at H5Eint.c:672
#1  0x00005620c365c9e7 in H5T__conv_s_s (src_id=<optimized out>, dst_id=216172782113784136,
    cdata=0x5620c42dfce0, nelmts=0, buf_stride=0, bkg_stride=<optimized out>, buf=0x0, bkg=0x0)
    at H5Tconv.c:4500
#2  0x00005620c3644ec7 in H5T__path_find_real (src=src@entry=0x5620c42d32d0,
    dst=dst@entry=0x5620c42dfa50, name=name@entry=0x0, conv=conv@entry=0x7ffe1ba51450) at H5T.c:4875
#3  0x00005620c36471b7 in H5T_path_find (src=src@entry=0x5620c42d32d0, dst=dst@entry=0x5620c42dfa50)
    at H5T.c:4679
#4  0x00005620c346742f in H5D__typeinfo_init (dset=dset@entry=0x5620c42c7470,
    mem_type_id=mem_type_id@entry=216172782113784134, do_write=do_write@entry=false,
    type_info=type_info@entry=0x7ffe1ba51560) at H5Dio.c:976
#5  0x00005620c34689a7 in H5D__read (dataset=dataset@entry=0x5620c42c7470,
    mem_type_id=mem_type_id@entry=216172782113784134, mem_space=0x5620c42ca0f0,
    file_space=0x5620c42ca0f0, buf=buf@entry=0x5620c42dfc00) at H5Dio.c:462
#6  0x00005620c34694d5 in H5Dread (dset_id=<optimized out>, mem_type_id=216172782113784134,
    mem_space_id=0, file_space_id=0, dxpl_id=720575940379279368, buf=0x5620c42dfc00) at H5Dio.c:197
#7  0x00005620c335af2f in hdf5_read_char (fid=72057594037927936,
    item=0x5620c37514f8 "/entry/instrument/detector/detectorSpecific/data_collection_date")
    at imginfo.c:2420
#8  0x00005620c335305c in get_header_eiger (
    path=0x7ffe1ba568d5 "/data/20240429_BMS-0003_11_1_1_master.h5", imgnum=0, h=0x7ffe1ba51e90)
    at imginfo.c:917
#9  0x00005620c335248e in get_header (buffer=0x7ffe1ba51fc0 "\211HDF\r\n\032\n", h=0x7ffe1ba51e90,
    path=0x7ffe1ba568d5 "/data/20240429_BMS-0003_11_1_1_master.h5", imgnum=0) at imginfo.c:695
#10 0x00005620c3351fa1 in main (argc=0, argv=0x7ffe1ba56108) at imginfo.c:570

GDYendell commented 4 months ago

OK. From that it looks like the problem is it that the strings are UTF-8, but only for the fixed-length strings and not variable-length strings? I am not sure why that would be the case.

It seems to me it is the reading application that is problem, but without seeing the code I am not sure. h5dump, h5ls and h5py can read these datasets fine. That being said, if it fixes this issue we can change the fixed-length string datasets to also be variable-length - there is no particular reason for them to be fixed length, it just seemed more optimal.

DominicOram commented 4 months ago

create_dataset("name", data=np.string_(name_str))

DominicOram commented 4 months ago

This is where the string conversion is done with h5py: https://github.com/odin-detector/odin-data/blob/master/python/src/odin_data/meta_writer/hdf5dataset.py#L417C22-L417C71

This is where these datasets are defined: https://github.com/dls-controls/eiger-detector/blob/master/python/src/eiger_detector/data/eiger_meta_writer.py

We would like to try changing all to length=None and encoding="ascii"

DominicOram commented 4 months ago

As part of this I will also change the types where they are obviously wrong e.g. omega_start is a float not an int

DominicOram commented 3 months ago

Upon further investigation we need to keep the fixed length of the strings. However, we would like there to be no chunking in this case, see https://diamondlightsource.slack.com/archives/G017HTC55M1/p1716567840558099

DominicOram commented 3 months ago

I think with the linked PR merged that we can call this done. Chunking discussion is for another day

DiamondLightSource / ADOdin

Change the data type of datasets in `_dectris` group to string #21