More flexible `cctbx.xfel.merge` file listing and validation

Baharis commented 1 year ago

Currently, cctbx.xfel.merge is very forgiving in terms of read experiment (expt) and reflections (refl) string identifiers (UUID). The input worker checks whether the expt UUID exists; if it does not, the expt is given a new UUID. Refl UUID is not checked at all: is it not compared against the expt UUID to assert a match, but rather overwritten with expt UUID. This can hide a pair of "matching" expts and refls with two different UUIDs or change refl UUID without notice if expt UUID is missing.

Additionally, the current input worker is somewhat inflexible in the way input expt and refl files are read. Both expt and refl files not only must have identical filename stems, but also must reside in the same directory. When expts and refls are stored in different directories i.e. in diffBragg hopper output, there is no way to read those files using existing functionality at all.

This PR suggests changes to the way the input worker (simple_file_loader) finds input file pairs and asserts a match between experiments and reflections contained in them.

Firstly, the new algorithm fixes all the expt-refl UUID match issues stated in the first paragraph. Now expt and refl UUIDs must EITHER match OR both evaluate to False (e.g. '' or None). In the first case, no additional action is taken. In the second, both expt and matching refl are assigned the same UUID based on the expt metadata using the existing algorithm. However – contrary to previous behavior – if expt and refl UUIDs do not match, the worker terminates with an informative KeyError: Expt and refl identifier mismatch: "expt_UUID" in /expt/path.refl vs "refl_UUID" in /refl/path.refl.

Secondly, I have modified the existing expt–refl path pair (PathPair) matching algorithm to allow a more flexible input of expt / refl files. The new algorithm works as follows:

If expt and refl files reside in the same directory and have matching filename stems, pair and accept them.
If an expt/refl file does not have a matching refl/expt in the same directory, remember its filename stem.
If at any point two unmatched expt/refl files have the same filename stem, terminate with an informative error.
Try matching all unmatched expt files with all unmatched refl files into pairs based on their filename stems.
Discard all remaining unmatched expts and refls.

The new algorithm can be easier explained based on an example. Let's assume the following directory structure (available at NERSC /global/cfs/cdirs/m2859/dtchon/ExptReflMatchExample):

.
├── inputA
│   ├── example1.expt
│   └── example1.refl
├── inputB
│   ├── example2.expt
│   └── example3.expt
├── inputC
│   ├── example1.refl
│   ├── example2.refl
│   └── example3.refl
└── inputD
    └── example3.refl

Let's read the data in this directory a few times using cctbx.xfel.merge input.phil, where input.phil specifies the following: input paths, dispatch.step_list=input, input.experiment=.expt, and input.reflection_suffix=.refl. The following table documents which pairs are ultimately read for each combination of inputA/B/C/D input paths:

Input `A`	Input `B`	Input `C`	Input `D`	Which example pairs are read?
✓				`example1`
	✓			None (no matching pairs)
		✓		None (no matching pairs)
			✓	None (no matching pairs)
✓	✓			`example1` (unmatched expts ignored)
✓		✓		`example1` (unmatched refls ignored)
✓			✓	`example1` (unmatched refl ignored)
	✓	✓		`example2` and `example3`
	✓		✓	`example3`
		✓	✓	Error: two unmatched `example3.refl`s
	✓	✓	✓	Error: two unmatched `example3.refl`s
✓		✓	✓	Error: two unmatched `example3.refl`s
✓	✓		✓	`example1` and `example3`
✓	✓	✓		`example1`–`3` (1.refl in inputC is ignored)
✓	✓	✓	✓	Error: two unmatched `example3.refl`s

I tried to preserve the existing structure of the file_lister.py, but since I could not use the existing generator, this yielded a very confusing file structure while also duplicating most of the code. For maintainability reasons, I opted to rewrite the confusing generator object into a function call instead. Consequently, creating a file list changed from a confusing lister = file_lister(self.params); file_list = list(lister.filepair_generator()) to simple file_list = list_input_pairs(self.params). Unfortunately, as a side effect, while all existing functionality in this file is preserved, the git authorship history in this file is lost.

Baharis commented 1 year ago

@phyy-nx This is the input worker change I discussed during last Tuesday's meeting. As much as I tried preserving the git history structure to make the review and history access easier, this resulted in an ugly and difficult-to-maintain structure with confusing names and duplicates. I opted to rewrite it as a more readable function call. I successfully tested the new reader on my PSII data (no changes) and current diffBragg stage 1 data (correctly raises mismatch errors and reads from two directories). Please let me know what you think and what else I should do here.

Baharis commented 1 year ago

@dermen While all existing functionalities are preserved in this code, I reformatted or moved a number of lines you contributed some 2 years ago concerning alist. I did not have a good reference to test these, so if you keep using input.alist, I would appreciate it if you could try them out on this branch. The behavior should be identical, with the added benefit that cctbx.xfel.merge will now be able to read directly fromhopper_stage_one/expers and hopper_stage_one/refls.

dermen commented 1 year ago

@Baharis , indeed I do use alist quite a bit.. I can try some tests

Baharis commented 1 year ago

@Baharis , indeed I do use alist quite a bit.. I can try some tests

@dermen I am sorry to ask about it, but I figured you might have some ready test-cases. Thank you and let me know if anything's wrong!

dermen commented 1 year ago

It might be useful to test alist functionality beyond my example (which I will test soon)

So, if your merge command is

cctbx.xfel.merge merge.phil input.path=something

assuming you have some *integrated.expt files in something, and that input.experiments_suffix=_integrated.expt (the default), then you can create an alist with some of the files by doing e.g.,

# here 100 is the number of files you want in the alist, change accordingly
ls $PWD/something/*_integrated.expt | head -n 100 > files.txt

and to merge while ignoring those e.g. 100 files you can then do

cctbx.xfel.merge merge.phil input.path=something alist.file=files.txt alist.type=files alist.op=reject

To only merge those 100 files, use alist.op=keep .

the iobs_main.log should report the correct number of inputs ...

Baharis commented 1 year ago

I tested the alist functionality in a few configurations. There was a tiny bug introduced when rewriting the implementation (if {1: 2} is a dict and {1, 2} is a set, what is {} 😛? ). Other than that alist works perfectly fine for me.

phyy-nx commented 1 year ago

@Baharis can you rebase on master? That will likely fix the XFEL CI tests.

Baharis commented 1 year ago

Some XFEL tests fail; I think it might have something to do with the way test data is defined; I'll look into that later.

Baharis commented 1 year ago

So my intuition was not wrong – reflections used by XFEL regression tests feature empty experiment_identifiers()!

In [1]: path = '$MODULES/xfel_regression/merging_test_data/big_input_data/idx-0061_integrated.pickle'
In [2]: r = flex.reflection_table.from_file(path)
In [3]: [sum(r['id'] == s) for s in range(5)]
Out[3]: [57, 109, 80, 91, 0]
In [4]: list(r.experiment_identifiers().keys())
Out[4]: []

The identifiers of all four reflections specified in idx-0061_integrated_experiments.json are empty strings. Therefore, I would expect the experiment_identifiers to exist and map all ids to ''. Then, according to the changes suggested in this PR, both expt and refl should have a new identifier generated and assigned.

@phyy-nx Do we expect this situation in any real applications? Should it be ever possible to have reflections with id, but no experiment_identifiers? Depending on the answers some updates to the tests might be due. Or can we never rely on reflection_table being defined?

Baharis commented 1 year ago

I have (temporarily?) allowed experiment_identifiers to be empty in order to look for any other issues. For the purpose of these tests, the loader will assume the identifier is '' whenever it can't get(experiment_id).

Edit: fun fact – in big_input_data experiment_identifiers are undefined; in small_ and cosym_input_data, refl.experiment_identifiers do not match expt.identifiers (presumably because only the latter are defined at all)!

Baharis commented 1 year ago

In response to the failing tests, I suggest a PR #11 to xfel_regression. It causes the input worker to ignore existing identifiers and reassign matching ones to both experiments and reflections by specifying input.override_identifiers=True. This allows the tests to run and investigate all potential issues with virtually no drawbacks other than not looking for the identifiers mismatch (which, as we know, do not match).

cctbx / cctbx_project

More flexible `cctbx.xfel.merge` file listing and validation #889