Closed jmp1985 closed 2 years ago
The current implementation of reading Dectris HDF5 files (created in 2013) is here:
dxtbx/format/FormatHDF5Dectris.py
This is a dxtbx wrapper of an iotbx class:
def _start(self): from iotbx.detectors.eiger import EIGERImage self.detectorbase = EIGERImage(self._image_file) self.detectorbase.readHeader()
The format class is currently disabled (it's understand method returns false), but when the iotbx class was created it was functional in the image viewer. The Dectris specification is likely changed.
This code should either go away or be used as a base for natively reading Eiger files.
On Mon, Feb 29, 2016 at 3:45 AM, jmp1985 notifications@github.com wrote:
Current solution of using an in-memory fixed file or saving to tmp does not work all the time. Don't want to duplicate loads of code, so I should just make the current code work for both.
— Reply to this email directly or view it on GitHub https://github.com/dials/dials/issues/93.
Hi James,
I came up with a simple solution to this read-only file problem. We can keep target file names in memory and open them separately if we cannot follow a link.
In FormatHDFEigerNearlyNexus.py:
# Change relative paths to absolute paths
for name in handle_orig['/entry/data'].iterkeys():
del handle['entry/data'][name]
filename = handle_orig['entry/data'][name].file.filename
handle['entry/data'][name] = h5py.ExternalLink(filename, 'entry/data/data')
handle['entry/data']["_filename_" + name] = filename # Store file names
In nexus.py
class DataFactory(object):
def __init__(self, obj):
import h5py
datasets = []
for key in sorted(list(obj.handle.iterkeys())):
if key.startswith("_filename_"):
continue
try:
datasets.append(obj.handle[key])
except KeyError: # If we cannot follow links due to lack of a write permission
datasets.append(h5py.File(obj.handle["_filename_" + key].value, "r")["/entry/data/data"])
self.model = DataList(datasets)
Thanks @biochem-fan. This looks like a nice solution. I'll add this change and verify that it works.
It is not present in the original file. I created it on the fly (in the memory) to pass target file names of links from the Fixer class to the DataFactory.
One concern is that if the number of files and threads are too large, we might exceed the maximum number of open files per process (1024 in Linux, see ulimit -a). But since we use sub-processes, not threads, during spot finding and integration, I don't think we actually reach the limit.
Yes, I realised this after looking at the bit of code at the top. Regarding the open file limit. Presumably, this should only be a problem if we have > 102400 images (i.e. with 100 images per block I think)?
You are right.
If this is a problem, we can defer file opening to DataList class.
OK, so I've verified that this works. I'm closing this for now.
Thanks for merging! T
This can be a problem if there is one image file per data file (don't ask why!), presumably running up against unix open file limit:
~/23IDB_2016_06_15/EigerTests/LysoA4/xia2_dials 18> dials.import ../lyso02sec_6_1_master.h5DIALS 1.2.2-g62d8f5d-release
Traceback (most recent call last):
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/python2.7/runpy.py", line 162, in _run_module_as_main
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/python2.7/runpy.py", line 72, in _run_code
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/command_line/import.py", line 554, in <module>
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/command_line/import.py", line 552, in <module>
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/command_line/import.py", line 266, in run
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/util/options.py", line 720, in parse_args
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/util/options.py", line 393, in parse_args
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/util/options.py", line 100, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/dials/util/options.py", line 154, in try_read_datablocks_from_images
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/datablock.py", line 822, in from_filenames
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/datablock.py", line 448, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/datablock.py", line 603, in _create_single_file_imageset
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/FormatHDFEigerNearlyNexus.py", line 267, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/FormatHDF5.py", line 14, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/Format.py", line 119, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/Format.py", line 130, in setup
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/FormatHDFEigerNearlyNexus.py", line 315, in _start
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/py2/site-packages/cctbx_project/dxtbx/format/nexus.py", line 1112, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/python2.7/site-packages/h5py/_hl/files.py", line 231, in __init__
File "/mnt/software/px64/CCP4-7.0.0/ccp4-7.0/lib/python2.7/site-packages/h5py/_hl/files.py", line 78, in make_fid
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ccb63/Downloads/devtools/checkout/h5py-2.4.0/h5py/_objects.c:2458)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ccb63/Downloads/devtools/checkout/h5py-2.4.0/h5py/_objects.c:2415)
File "h5py/h5f.pyx", line 72, in h5py.h5f.open (/home/ccb63/Downloads/devtools/checkout/h5py-2.4.0/h5py/h5f.c:1792)
IOError: Please report this error to dials-support@lists.sourceforge.net: Unable to open file (Unable to open file: name = '/mnt/beegfs/eigertest2/23idb_2016_06_15/eigertests/lysoa4/xia2_dials/../lyso02sec_6_1_data_001018.h5', errno = 24, error message = 'too many open files', flags = 0, o_flags = 0)
How many images files are we talking about here? If you can bring the data back with you, then that would be great!
I implemented my suggestion on Mar 3rd.
In getitem(), we left only one file open (cache the file handle). If the requested frame is in another file, we close the current handle before opening the file.
I didn't cache the file handle because it will make the code non thread-safe. But opening/closing many times might cause performance degradation.
I confirmed that I can process a dataset on a read-only file system and the number of open files was reduced (lsof -p PID
). Could you please test it more thoroughly?
Personally, I don't feel like reverting to the previous implementation that makes temporary files because it created more than two times nproc
files to process just one dataset. It quickly filled up our computing nodes with very small /tmp
.
Fix to nexus.py
:
class DataList(object):
'''
A class to make it easier to access the data from multiple datasets.
FIXME The file should be fixed and this should be removed
'''
def __init__(self, obj):
self.datasets = obj
self.num_images = 0
self.lookup = []
self.offset = [0]
import h5py
for i, dataset in enumerate(self.datasets):
data = None
if isinstance(dataset, basestring):
data = h5py.File(dataset, "r")["/entry/data/data"]
else:
data = dataset
self.num_images += data.shape[0]
self.lookup.extend([i] * data.shape[0])
self.offset.append(self.num_images)
if isinstance(dataset, basestring):
data.file.close()
def __len__(self):
return self.num_images
def __getitem__(self, index):
from scitbx.array_family import flex
import h5py
import numpy as np
d = self.lookup[index]
i = index - self.offset[d]
dataset = None
if isinstance(self.datasets[d], basestring):
dataset = h5py.File(self.datasets[d], "r")["/entry/data/data"]
else:
dataset = self.datasets[d]
data = dataset[i,:,:]
if data.dtype == np.uint16:
data = data.astype(np.uint32)
data_as_flex = flex.int(data)
if isinstance(self.datasets[d], basestring):
dataset.file.close()
return data_as_flex
class DataFactory(object):
def __init__(self, obj):
datasets = []
for key in sorted(list(obj.handle.iterkeys())):
if key.startswith("_filename_"):
continue
try:
datasets.append(obj.handle[key])
except KeyError: # If we cannot follow links due to lack of a write permission
datasets.append(obj.handle["_filename_" + key].value)
self.model = DataList(datasets)
Hi all, is there any resolution to this?
Tagging @biochem-fan. Does your fix still need to be checked in?
@dagewa The resolution is for me to do it! Currently, it still creates a modified HDF file in memory which shouldn't be necessary.
I think our current code is OK unless the dataset consists of pathologically many data_XXXXXX.h5 files (like @rjgildea's).
This is our oldest open issue. Is it still relevant?
Yes. Until the Eiger firmware is writing NeXus files natively, it makes sense to track an issue that basically says "FormatHDF5EigerNearlyNexus.py should not exist". I know DLS is writing NeXus files from Eiger. Remind me, is that creating a new master or fixing the existing master?
This issue has been automatically marked as stale because it has not had recent activity. The label will be removed automatically if any activity occurs. Thank you for your contributions.
Current solution of using an in-memory fixed file or saving to tmp does not work all the time. Don't want to duplicate loads of code, so I should just make the current code work for both.