cctbx / dxtbx

Diffraction Experiment Toolbox
BSD 3-Clause "New" or "Revised" License
2 stars 18 forks source link

Imageset `paths()` manipulates paths according to OS, breaking filename checking in DIALS image grouping code #771

Open dagewa opened 14 hours ago

dagewa commented 14 hours ago

This code in DIALS gets called by xia2.ssx_reduce:

    def _get_expt_file_to_groupsdata(self, data_file_pairs: List[FilePair]):
        expt_file_to_groupsdata: Dict[Path, GroupsForExpt] = {}

        for fp in data_file_pairs:
            expts = load.experiment_list(fp.expt, check_format=False)
            # need to match the images to the imagesets.
            images = set()
            image_to_group_info = {}
            for iset in expts.imagesets():
                images.update(iset.paths())
            for iset in images:
                image = None
                for ifile in self._files_to_groups_dict.keys():
                    if iset == ifile.name:
                        image = ifile
                        break
                if image is None:
                    raise ValueError(f"Imageset {iset} not found in metadata")
                image_to_group_info[iset] = self._files_to_groups_dict[image]
...

The call to iset.paths() on Windows returns something like this

In [12]: iset.paths()
Out[12]: ['c:\\dls\\mx\\data\\nt30330\\nt30330-15\\VMXi-AB1698\\well_42\\images\\image_58766.nxs']

The original path (a location on the DLS file system) has been manipulated by adding a drive letter and altering the path separator. This causes a failure later on when these paths are checked against cached file names in a dictionary (if iset == ifile.name), because the manipulated path does not match the expected path and we end up with

Error: Imageset c:\dls\mx\data\nt30330\nt30330-15\VMXi-AB1698\well_42\images\image_58766.nxs not found in metadata
dagewa commented 13 hours ago

This is the immediate cause of this test failure on Windows https://github.com/xia2/xia2/pull/824#issuecomment-2491813579

dagewa commented 13 hours ago

In this case, paths() is

In [11]: iset.paths
Out[11]: <bound method _.paths of <dxtbx.imageset.ImageSetLazy object at 0x000001CCFBEB6A70>>
ndevenish commented 13 hours ago

ImageSetLazy

Should we be expecting to see this anywhere outside of stills_process? I thought this was only for handling inside there?

dagewa commented 13 hours ago

No idea about ImageSetLazy...

dagewa commented 13 hours ago

Short reproducer of the issue

>>> from dxtbx.model.experiment_list import ExperimentList
>>> el=ExperimentList.from_file("C:/Users/fcx32934/.cache/dials_data/dtpb_serial_processed/well39_batch12_integrated.expt", check_format=False)
>>> el[0].imageset.paths()
['c:\\dls\\mx\\data\\nt30330\\nt30330-15\\VMXi-AB1698\\well_39\\images\\image_58763.nxs']
ndevenish commented 11 hours ago

I think here is where it goes wrong: https://github.com/cctbx/dxtbx/blob/ea6fd5f998c7103e920e8b44ddb672165be8a0f1/src/dxtbx/model/experiment_list.py#L521-L524

Calling this: https://github.com/cctbx/dxtbx/blob/ea6fd5f998c7103e920e8b44ddb672165be8a0f1/src/dxtbx/serialize/filename.py#L6-L24

ndevenish commented 11 hours ago

.. I think that this handling is possibly necessary because we retain the ability to have relative path names, which AFAIK we only use in tests? If this was always required to be absolute, we could probably skip this?

Or alternatively, only do this replacement if it ends up pointing to a file that actually exists, otherwise leave it as the original filename?

dagewa commented 11 hours ago

Damn, your mention of resolve_path triggered some stirring of the grey matter and I went back and found this https://github.com/cctbx/dxtbx/issues/613. I have been in this dungeon before.

ndevenish commented 11 hours ago

I wonder if this is actually the best way to resolve:

only do this replacement if it ends up pointing to a file that actually exists, otherwise leave it as the original filename?

dagewa commented 10 hours ago

That makes a lot of sense