Closed Baharis closed 1 year ago
@phyy-nx This is the input worker change I discussed during last Tuesday's meeting. As much as I tried preserving the git history structure to make the review and history access easier, this resulted in an ugly and difficult-to-maintain structure with confusing names and duplicates. I opted to rewrite it as a more readable function call. I successfully tested the new reader on my PSII data (no changes) and current diffBragg stage 1 data (correctly raises mismatch errors and reads from two directories). Please let me know what you think and what else I should do here.
@dermen While all existing functionalities are preserved in this code, I reformatted or moved a number of lines you contributed some 2 years ago concerning alist
. I did not have a good reference to test these, so if you keep using input.alist
, I would appreciate it if you could try them out on this branch. The behavior should be identical, with the added benefit that cctbx.xfel.merge
will now be able to read directly fromhopper_stage_one/expers
and hopper_stage_one/refls
.
@Baharis , indeed I do use alist quite a bit.. I can try some tests
@Baharis , indeed I do use alist quite a bit.. I can try some tests
@dermen I am sorry to ask about it, but I figured you might have some ready test-cases. Thank you and let me know if anything's wrong!
It might be useful to test alist functionality beyond my example (which I will test soon)
So, if your merge command is
cctbx.xfel.merge merge.phil input.path=something
assuming you have some *integrated.expt
files in something
, and that input.experiments_suffix=_integrated.expt
(the default), then you can create an alist with some of the files by doing e.g.,
# here 100 is the number of files you want in the alist, change accordingly
ls $PWD/something/*_integrated.expt | head -n 100 > files.txt
and to merge while ignoring those e.g. 100 files you can then do
cctbx.xfel.merge merge.phil input.path=something alist.file=files.txt alist.type=files alist.op=reject
To only merge those 100 files, use alist.op=keep
.
the iobs_main.log should report the correct number of inputs ...
I tested the alist
functionality in a few configurations. There was a tiny bug introduced when rewriting the implementation (if {1: 2}
is a dict and {1, 2}
is a set, what is {}
😛? ). Other than that alist
works perfectly fine for me.
@Baharis can you rebase on master? That will likely fix the XFEL CI tests.
Some XFEL tests fail; I think it might have something to do with the way test data is defined; I'll look into that later.
So my intuition was not wrong – reflections used by XFEL regression tests feature empty experiment_identifiers()
!
In [1]: path = '$MODULES/xfel_regression/merging_test_data/big_input_data/idx-0061_integrated.pickle'
In [2]: r = flex.reflection_table.from_file(path)
In [3]: [sum(r['id'] == s) for s in range(5)]
Out[3]: [57, 109, 80, 91, 0]
In [4]: list(r.experiment_identifiers().keys())
Out[4]: []
The identifiers of all four reflections specified in idx-0061_integrated_experiments.json
are empty strings. Therefore, I would expect the experiment_identifiers
to exist and map all id
s to ''
. Then, according to the changes suggested in this PR, both expt and refl should have a new identifier generated and assigned.
@phyy-nx Do we expect this situation in any real applications? Should it be ever possible to have reflections with id
, but no experiment_identifiers
? Depending on the answers some updates to the tests might be due. Or can we never rely on reflection_table
being defined?
I have (temporarily?) allowed experiment_identifiers
to be empty in order to look for any other issues. For the purpose of these tests, the loader will assume the identifier is ''
whenever it can't get(experiment_id)
.
Edit: fun fact – in big_input_data
experiment_identifiers
are undefined; in small_
and cosym_input_data
, refl.experiment_identifiers
do not match expt.identifiers
(presumably because only the latter are defined at all)!
In response to the failing tests, I suggest a PR #11 to xfel_regression
. It causes the input worker to ignore existing identifiers and reassign matching ones to both experiments and reflections by specifying input.override_identifiers=True
. This allows the tests to run and investigate all potential issues with virtually no drawbacks other than not looking for the identifiers mismatch (which, as we know, do not match).
Currently,
cctbx.xfel.merge
is very forgiving in terms of read experiment (expt) and reflections (refl) string identifiers (UUID). The input worker checks whether the expt UUID exists; if it does not, the expt is given a new UUID. Refl UUID is not checked at all: is it not compared against the expt UUID to assert a match, but rather overwritten with expt UUID. This can hide a pair of "matching" expts and refls with two different UUIDs or change refl UUID without notice if expt UUID is missing.Additionally, the current input worker is somewhat inflexible in the way input expt and refl files are read. Both expt and refl files not only must have identical filename stems, but also must reside in the same directory. When expts and refls are stored in different directories i.e. in diffBragg hopper output, there is no way to read those files using existing functionality at all.
This PR suggests changes to the way the input worker (
simple_file_loader
) finds input file pairs and asserts a match between experiments and reflections contained in them.Firstly, the new algorithm fixes all the expt-refl UUID match issues stated in the first paragraph. Now expt and refl UUIDs must EITHER match OR both evaluate to False (e.g.
''
orNone
). In the first case, no additional action is taken. In the second, both expt and matching refl are assigned the same UUID based on the expt metadata using the existing algorithm. However – contrary to previous behavior – if expt and refl UUIDs do not match, the worker terminates with an informativeKeyError: Expt and refl identifier mismatch: "expt_UUID" in /expt/path.refl vs "refl_UUID" in /refl/path.refl
.Secondly, I have modified the existing expt–refl path pair (
PathPair
) matching algorithm to allow a more flexible input of expt / refl files. The new algorithm works as follows:The new algorithm can be easier explained based on an example. Let's assume the following directory structure (available at NERSC
/global/cfs/cdirs/m2859/dtchon/ExptReflMatchExample
):Let's read the data in this directory a few times using
cctbx.xfel.merge input.phil
, whereinput.phil
specifies the following: input paths,dispatch.step_list=input
,input.experiment=.expt
, andinput.reflection_suffix=.refl
. The following table documents which pairs are ultimately read for each combination ofinputA
/B
/C
/D
input paths:A
B
C
D
example1
example1
(unmatched expts ignored)example1
(unmatched refls ignored)example1
(unmatched refl ignored)example2
andexample3
example3
example3.refl
sexample3.refl
sexample3.refl
sexample1
andexample3
example1
–3
(1.refl in inputC is ignored)example3.refl
sI tried to preserve the existing structure of the
file_lister.py
, but since I could not use the existing generator, this yielded a very confusing file structure while also duplicating most of the code. For maintainability reasons, I opted to rewrite the confusing generator object into a function call instead. Consequently, creating a file list changed from a confusinglister = file_lister(self.params); file_list = list(lister.filepair_generator())
to simplefile_list = list_input_pairs(self.params)
. Unfortunately, as a side effect, while all existing functionality in this file is preserved, the git authorship history in this file is lost.