European-XFEL / EXtra-data

Access saved EuXFEL data
https://extra-data.rtfd.io
BSD 3-Clause "New" or "Revised" License
7 stars 14 forks source link

Issue with XtdfDetectorBase._collect_inner_ids() for some specific runs of raw AGIPD data. #513

Closed turkot closed 7 months ago

turkot commented 7 months ago

There is an issue with collecting inner ids for certain runs of raw data in p2746 and p2995 which Janusz Malka is currently trying to reduce. The issue can be reproduced with:

from extra_data import RunDirectory
from extra_data.components import identify_multimod_detectors

run_folder = '/gpfs/exfel/data/user/xred/data_reduction/002746_case/input_raw/r0018'
run_data = RunDirectory(run_folder)
_, det_class = identify_multimod_detectors(run_data, single=True)
det_data = det_class(run_data)
pulse_ids = det_data._collect_inner_ids('pulseId')

This leads to an error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[18], line 8
      6 _, det_class = identify_multimod_detectors(run_data, single=True)
      7 det_data = det_class(run_data)
----> 8 pulse_ids = det_data._collect_inner_ids('pulseId')

File /gpfs/exfel/data/user/turkot/EXtra-data/extra_data/components.py:568, in XtdfDetectorBase._collect_inner_ids(self, field)
    552 def _collect_inner_ids(self, field='pulseId'):
    553     """
    554     Gather pulse/cell ID labels for all modules and check consistency.
    555 
   (...)
    566       Array of pulse/cell IDs per frame common for all detector modules.
    567     """
--> 568     inner_ids = self._read_inner_ids(field)
    569     # Sanity checks on pulse IDs
    570     inner_ids_min: np.ndarray = inner_ids.min(axis=1)

File /gpfs/exfel/data/user/turkot/EXtra-data/extra_data/components.py:540, in XtdfDetectorBase._read_inner_ids(self, field)
    538 dset = chunk.dataset
    539 unwanted_dim = (dset.ndim > 1)  and (dset.shape[1] == 1)
--> 540 for tgt_slice, chunk_slice in self._split_align_chunk(
    541         chunk, self.train_ids_perframe
    542 ):
    543     # Select the matching data and add it to pulse_ids
    544     # In some cases, there's an extra dimension of length 1.
    545     matched = chunk.dataset[chunk_slice]
    546     if unwanted_dim:

File /gpfs/exfel/data/user/turkot/EXtra-data/extra_data/components.py:258, in MultimodDetectorBase._split_align_chunk(chunk, target_train_ids)
    253 tgt_start = (target_train_ids == chunk_tids[0]).nonzero()[0][0]
    255 target_tids = target_train_ids[
    256     tgt_start : tgt_start + len(chunk_tids)
    257 ]
--> 258 assert target_tids.shape == chunk_tids.shape, \
    259     f"{target_tids.shape} != {chunk_tids.shape}"
    260 assert target_tids[0] == chunk_tids[0], \
    261     f"{target_tids[0]} != {chunk_tids[0]}"
    263 # How much of this chunk can be mapped in one go?

AssertionError: (352,) != (3168,)

The same snippet of code is working fine for most of other raw runs, for example for:

run_folder = '/gpfs/exfel/data/user/xred/data_reduction/002746_case/input_raw/r0019'
turkot commented 7 months ago

As was investigated by @takluyver , the source of this issue is a problem in the RAW data - some of the trains have wrong train id values, for example: 1213530111 1213530113 1213595648 1213530114 1213530115 (here train id 1213595648 is ~65k higher than neighbors). Such trains can be avoided by opening run data with inc_suspect_trains=False, for example:

run_data = RunDirectory(run_folder, inc_suspect_trains=False)

I'm closing this issue since it can be avoided with a use of an already existing solution.