Make Nexus format handle Eiger files instead of using fixer class

jmp1985 commented 8 years ago

Current solution of using an in-memory fixed file or saving to tmp does not work all the time. Don't want to duplicate loads of code, so I should just make the current code work for both.

phyy-nx commented 8 years ago

The current implementation of reading Dectris HDF5 files (created in 2013) is here:

dxtbx/format/FormatHDF5Dectris.py

This is a dxtbx wrapper of an iotbx class:

def _start(self): from iotbx.detectors.eiger import EIGERImage self.detectorbase = EIGERImage(self._image_file) self.detectorbase.readHeader()

The format class is currently disabled (it's understand method returns false), but when the iotbx class was created it was functional in the image viewer. The Dectris specification is likely changed.

This code should either go away or be used as a base for natively reading Eiger files.

On Mon, Feb 29, 2016 at 3:45 AM, jmp1985 notifications@github.com wrote:

Current solution of using an in-memory fixed file or saving to tmp does not work all the time. Don't want to duplicate loads of code, so I should just make the current code work for both.

— Reply to this email directly or view it on GitHub https://github.com/dials/dials/issues/93.

biochem-fan commented 8 years ago

Hi James,

I came up with a simple solution to this read-only file problem. We can keep target file names in memory and open them separately if we cannot follow a link.

In FormatHDFEigerNearlyNexus.py:

    # Change relative paths to absolute paths                                                                                                           
    for name in handle_orig['/entry/data'].iterkeys():
      del handle['entry/data'][name]
      filename = handle_orig['entry/data'][name].file.filename
      handle['entry/data'][name] = h5py.ExternalLink(filename, 'entry/data/data')
      handle['entry/data']["_filename_" + name] = filename # Store file names

In nexus.py

class DataFactory(object):
  def __init__(self, obj):
    import h5py
    datasets = []
    for key in sorted(list(obj.handle.iterkeys())):
      if key.startswith("_filename_"):
        continue
      try:
        datasets.append(obj.handle[key])
      except KeyError: # If we cannot follow links due to lack of a write permission
        datasets.append(h5py.File(obj.handle["_filename_" + key].value, "r")["/entry/data/data"])

    self.model = DataList(datasets)

jmp1985 commented 8 years ago

Thanks @biochem-fan. This looks like a nice solution. I'll add this change and verify that it works.

biochem-fan commented 8 years ago

It is not present in the original file. I created it on the fly (in the memory) to pass target file names of links from the Fixer class to the DataFactory.

One concern is that if the number of files and threads are too large, we might exceed the maximum number of open files per process (1024 in Linux, see ulimit -a). But since we use sub-processes, not threads, during spot finding and integration, I don't think we actually reach the limit.

jmp1985 commented 8 years ago

Yes, I realised this after looking at the bit of code at the top. Regarding the open file limit. Presumably, this should only be a problem if we have > 102400 images (i.e. with 100 images per block I think)?

biochem-fan commented 8 years ago

You are right.

If this is a problem, we can defer file opening to DataList class.

In init(), we check the number of frames in each file, one by one.
In getitem(), we left only one file open (cache the file handle). If the requested frame is in another file, we close the current handle before opening the file.

jmp1985 commented 8 years ago

OK, so I've verified that this works. I'm closing this for now.

biochem-fan commented 8 years ago

Thanks for merging! T

rjgildea commented 8 years ago

This can be a problem if there is one image file per data file (don't ask why!), presumably running up against unix open file limit:

~/23IDB_2016_06_15/EigerTests/LysoA4/xia2_dials 18> dials.import ../lyso02sec_6_1_master.h5DIALS 1.2.2-g62d8f5d-release
Traceback (most recent call last):
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/python2.7/runpy.py", line 162, in _run_module_as_main
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/python2.7/runpy.py", line 72, in _run_code
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/command_line/import.py", line 554, in <module>
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/command_line/import.py", line 552, in <module>
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/command_line/import.py", line 266, in run
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/util/options.py", line 720, in parse_args
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/util/options.py", line 393, in parse_args
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/util/options.py", line 100, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/util/options.py", line 154, in try_read_datablocks_from_images
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/datablock.py", line 822, in from_filenames
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/datablock.py", line 448, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/datablock.py", line 603, in _create_single_file_imageset
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/FormatHDFEigerNearlyNexus.py", line 267, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/FormatHDF5.py", line 14, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/Format.py", line 119, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/Format.py", line 130, in setup
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/FormatHDFEigerNearlyNexus.py", line 315, in _start
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/nexus.py", line 1112, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/python2.7/site-packages/h5py/_hl/files.py", line 231, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/python2.7/site-packages/h5py/_hl/files.py", line 78, in make_fid
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ccb63/Downloads/devtools/checkout/h5py-2.4.0/h5py/_objects.c:2458)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ccb63/Downloads/devtools/checkout/h5py-2.4.0/h5py/_objects.c:2415)
File "h5py/h5f.pyx", line 72, in h5py.h5f.open (/home/ccb63/Downloads/devtools/checkout/h5py-2.4.0/h5py/h5f.c:1792)
IOError: Please report this error to dials-support@lists.sourceforge.net: Unable to open file (Unable to open file: name = '/mnt/beegfs/eigertest2/23idb_2016_06_15/eigertests/lysoa4/xia2_dials/../lyso02sec_6_1_data_001018.h5', errno = 24, error message = 'too many open files', flags = 0, o_flags = 0)

jmp1985 commented 8 years ago

How many images files are we talking about here? If you can bring the data back with you, then that would be great!

biochem-fan commented 8 years ago

I implemented my suggestion on Mar 3rd.

In getitem(), we left only one file open (cache the file handle). If the requested frame is in another file, we close the current handle before opening the file.

I didn't cache the file handle because it will make the code non thread-safe. But opening/closing many times might cause performance degradation.

I confirmed that I can process a dataset on a read-only file system and the number of open files was reduced (lsof -p PID). Could you please test it more thoroughly?

Personally, I don't feel like reverting to the previous implementation that makes temporary files because it created more than two times nproc files to process just one dataset. It quickly filled up our computing nodes with very small /tmp.

Fix to nexus.py:

class DataList(object):
  '''
  A class to make it easier to access the data from multiple datasets.
  FIXME The file should be fixed and this should be removed
  '''

  def __init__(self, obj):
    self.datasets = obj
    self.num_images = 0
    self.lookup = []
    self.offset = [0]

    import h5py
    for i, dataset in enumerate(self.datasets):
      data = None
      if isinstance(dataset, basestring):
        data = h5py.File(dataset, "r")["/entry/data/data"]
      else:
        data = dataset

      self.num_images += data.shape[0]
      self.lookup.extend([i] * data.shape[0])
      self.offset.append(self.num_images)

      if isinstance(dataset, basestring):
        data.file.close()

  def __len__(self):
    return self.num_images

  def __getitem__(self, index):
    from scitbx.array_family import flex
    import h5py
    import numpy as np

    d = self.lookup[index]
    i = index - self.offset[d]

    dataset = None
    if isinstance(self.datasets[d], basestring):
      dataset = h5py.File(self.datasets[d], "r")["/entry/data/data"]
    else:
      dataset = self.datasets[d]

    data = dataset[i,:,:]
    if data.dtype == np.uint16:
      data = data.astype(np.uint32)
    data_as_flex = flex.int(data)

    if isinstance(self.datasets[d], basestring):
      dataset.file.close()

    return data_as_flex

class DataFactory(object):
  def __init__(self, obj):
    datasets = []
    for key in sorted(list(obj.handle.iterkeys())):
      if key.startswith("_filename_"):
        continue
      try:
        datasets.append(obj.handle[key])
      except KeyError: # If we cannot follow links due to lack of a write permission
        datasets.append(obj.handle["_filename_" + key].value)

    self.model = DataList(datasets)

dagewa commented 6 years ago

Hi all, is there any resolution to this?

phyy-nx commented 6 years ago

Tagging @biochem-fan. Does your fix still need to be checked in?

jmp1985 commented 6 years ago

@dagewa The resolution is for me to do it! Currently, it still creates a modified HDF file in memory which shouldn't be necessary.

biochem-fan commented 6 years ago

I think our current code is OK unless the dataset consists of pathologically many data_XXXXXX.h5 files (like @rjgildea's).

dagewa commented 5 years ago

This is our oldest open issue. Is it still relevant?

phyy-nx commented 5 years ago

Yes. Until the Eiger firmware is writing NeXus files natively, it makes sense to track an issue that basically says "FormatHDF5EigerNearlyNexus.py should not exist". I know DLS is writing NeXus files from Eiger. Remind me, is that creating a new master or fixing the existing master?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. The label will be removed automatically if any activity occurs. Thank you for your contributions.

dials / dials

Make Nexus format handle Eiger files instead of using fixer class #93