`FormatEigerNXmxFilewriter.get_raw_data` fails when number of linked data files is large

spmeisburger commented 7 months ago

This bug surfaced when we took a dataset with a huge number of frames (writing ~1100 data.h5 files). When we went to run dials.find_spots on this dataset, DIALS was extremely slow to start, and it failed with an error:

unable to open external link file name = '[...]_data_001013.h5'.

I think the error is related to our use of NFS for data storage, i.e. there were too many file handles open at once. On a dataset with < 1000 data*.h5 files, dials.find_spots ran without an error but was extremely slow.

The traceback refers to this line in FormatEigerNXmxFilewriter.get_raw_data:

data_subsets = [v for k, v in sorted(nxdata.items()) if DATA_FILE_RE.match(k)]

https://github.com/cctbx/dxtbx/blob/f9013668291ff4bd8d1725178275444d72ac2fd1/src/dxtbx/format/FormatNXmxEigerFilewriter.py#L105C5-L105C83

It looks like every single linked _data.h5 file is loaded here, for every call to get_raw_data(), which is kind of crazy. However, doing this appears to be essential because the number of images per data file is not stored anywhere, as far as I can tell.

I made a patch in our local DIALS installation that mostly solves the problem, here: https://github.com/FlexXBeamline/dials-extensions/blob/faster-read-raw/dials_extensions/FormatNXmxEigerFilewriterCHESS.py

The patch only loads the first data file in the series, and uses it to determine the data shape. Perhaps something like this could be incorporated into FormatEigerNXmxFilewriter?

graeme-winter commented 7 months ago

@spmeisburger thanks for raising this - wasn't aware that this nonsense could happen for every get_raw_data() call: I am sure this is handled more gracefully for non-filewriter data.

Meanwhile you can also ulimit -n unlimited which will also make things work

graeme-winter commented 7 months ago

Suspect what we need to do is actually do the iteration step in the _start() method at https://github.com/cctbx/dxtbx/blob/f9013668291ff4bd8d1725178275444d72ac2fd1/src/dxtbx/format/FormatNXmxEigerFilewriter.py#L31 and cache the indexing such that the get_raw_data() method can just seek and read. However, I am also fairly sure that there is a good implementation of this elsewhere in dxtbx which does this from C++ which could be preferable 🤔 - certainly I remember this for the old "nearly nexus" and friends format

biochem-fan commented 7 months ago

I'm not sure if we have a C++ version. Python codes for NearlyNexus are:

https://github.com/cctbx/dxtbx/blob/f9013668291ff4bd8d1725178275444d72ac2fd1/src/dxtbx/format/FormatHDF5EigerNearlyNexus.py#L124 https://github.com/cctbx/dxtbx/blob/f9013668291ff4bd8d1725178275444d72ac2fd1/src/dxtbx/format/nexus.py#L1341

cctbx / dxtbx

`FormatEigerNXmxFilewriter.get_raw_data` fails when number of linked data files is large #695