Open graeme-winter opened 4 years ago
Proposal: eliminate image_range as a data object, only retain it at the UI level.
Redefine batch_offset
to present the difference between the start of frame 0 (i.e. start of array) and the first image number i.e. foo_00001.cbf would be 1. This is a different definition to the currently used one.
This fix would not be backwards compatible.
Reappropriating batch_offset
might not be so simple, as this was specifically introduced to allow tracking/handling of unique batch numbers for multi-crystal datasets in the mtz sense:
https://github.com/cctbx/dxtbx/commit/1a7b04e247e7adce71b3e944bc91654f8a7c8f80
Is that how it is actually used though?
I ask this as someone who's had to poke at batch_offset
in scans a fair bit recently, in processing single scan data sets - and it's an eternal source of confusion. It may also be used for that.
I'd like to think that for us we internally consistently count from 0, not from 1, and the fact that MTZ needs to count from 1 should be handled in exporting MTZ files, I believe.
At the moment we count from 🎲(0,1) depending on where in the code we are, which I think is the worst possible outcome.
Is that how it is actually used though?
Yes, this is what it was added for. It is used to ensure that mtz batch offsets remain consistent for a given data set across multiple possible subsets of data sets:
I'm not sure that batch_offset
was ever strictly used/relevant in the context of processing a single scan - i.e. leaving it as the default value of 0
should work downstream e.g. with export_mtz
.
Fair enough - however if you have a scan which starts at image != 1 you will find that it is used in processing. You'll also find yourself in the place which sends you mad trying to make sense of it (and dealing with the fallout thereof in e.g. output of histogram from dials.find_spots
.
Perhaps much of the confusion of this is we have 1 number which needs to do (probably) n >= 2
things, and which thing you think is the main thing depends on what you have looked at most recently 🙂
I don't see where it is used in processing outside of the two places in dials I linked above - i.e. in export_mtz
and filter_reflections
:
https://github.com/dials/dials/search?q=get_batch_offset&unscoped_q=get_batch_offset
Grey-Area tmp :) $ dials.import ~/data/i04_bag_training/th_8_2_002?.cbf.gz
DIALS (2018) Acta Cryst. D74, 85-97. https://doi.org/10.1107/S2059798317017235
DIALS 3.dev.117-g6449cf6e6
The following parameters have been modified:
input {
experiments = <image files>
}
--------------------------------------------------------------------------------
format: <class 'dxtbx.format.FormatCBFMiniPilatusDLS6MSN100.FormatCBFMiniPilatusDLS6MSN100'>
num images: 10
sequences:
still: 0
sweep: 1
num stills: 0
--------------------------------------------------------------------------------
Writing experiments to imported.expt
Grey-Area tmp :) $ grep offset imported.expt
"raw_image_offset": [
"raw_image_offset": [
"batch_offset": 19,
Then also here (which is wrong)
Histogram of per-image spot count for imageset 0:
2639 spots found on 9 images (max 478 / bin)
*
* *
* *
* *
* *
*********
*********
*********
*********
*********
20 28
--------------------------------------------------------------------------------
Saved 2639 reflections to strong.refl
etc.
You'll find that it is touched all over the shop
I see that commit https://github.com/cctbx/dxtbx/commit/1b793ec7a9080706085a415b6d2ae69feb2fbc5f introduced its use in a different context for which it was originally introduced in https://github.com/cctbx/dxtbx/commit/1a7b04e247e7adce71b3e944bc91654f8a7c8f80. It was never meant to be set on import according to the particular image range that was used, i.e. it should be the default, i.e. 0, whichever image range was used.
The array_range
is always equal to image_range[0] - 1, image_range[1]
https://github.com/cctbx/dxtbx/blob/bbb9f7eb50340811f1a3b942fb0eb416fa0218c5/model/scan.h#L194-L197
so the array_range
concept is not useful for actually finding an array index to address the data.
For a dataset not starting at image zero dials.import centroid_000{8..9}.cbf
>>> el[0].scan.get_array_range()
(7, 9)
>>> el[0].scan.get_image_range()
(8, 9)
>>> el[0].imageset.get_raw_data(0)
(<scitbx_array_family_flex_ext.int object at 0x7f7abc524df0>,)
>>> el[0].imageset.get_raw_data(8)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dxtbx Internal Error: /home/fcx32934/sw/cctbx/modules/dxtbx/imageset.h(624): DXTBX_ASSERT(index < indices_.size()) failure.
In my opinion, the current array_range
is the more broken concept here. I believe a useful first step to tidying up would be to replace every use of this with image_range[0] - 1, image_range[1]
(or whatever is appropriate on a case-by-case basis), then remove the array_range
entirely from the API.
🤔
I am not sure that I jump to the same conclusion on this topic. In fact, I almost certainly jump to the alternate conclusion of using array_range
as the canonical representation with some integral offset which maps us to people-image-number
Reasoning: the array offset in the general sense, not the sense as it is used in dxtbx, is well defined and well understood. Image range can easily count from some random integer so is not universally well defined.
I accept that this distinction could well be representative of a point of view.
I also recognise that removing exactly one of these could be done as a semi-automated transformation which could be the first (or zeroth) commit on a branch to investigate these topics.
Suggest here we are in a good position to start with two branches to investigate this - route A and route B, then discuss the changes and select the preferable solution? Based on how invasive / insane the change sets end up looking.
I think David's point is that in its current form, I'm not sure what array_range
is actually useful for. And if change it such that it always starts at 0
, then it still isn't useful, as it would always be (0, len(imageset))
.
Yes, I don't think A and B are incompatible. The point is that the current array_range
concept does not make sense. We could aim for a future in which we only store an offset to go from array index to people index, but to get there we anyway have to remove what we currently call array_range
Indeed @graeme-winter your point
I also recognise that removing exactly one of these could be done as a semi-automated transformation which could be the first (or zeroth) commit on a branch to investigate these topics.
Was in fact the only change I was proposing so far. Actually, I don't mind if we remove image_range
instead. As we noted before, often we call one of these ranges only to use one element. It does make more sense to me if eventually Scan
just stored the number of images and a computer-to-people offset number as separate attributes.
I genuinely think the overhead of looking at this from both perspectives is justified.
@rjgildea - if array_range is not useful as such (i.e. always starts from 0 under any circumstance) then that is still a useful outcome - in fact even more so since we would iterate through images with something like
for i in range(len(imageset)):
print(f"The {i}'th image is called {imageset.get_people_number(i)}")
and
for i in range(len(imageset)):
data = imageset.get_raw_data(i)
would do the same as as possible
for data in imageset:
#things
I maybe feel strongly enough that perhaps this should be the API: we make sure that len()
works as expected and provide an integral offset to transform to people numbers (which makes things internally sane and transforms at the edges, as proposed by @ndevenish )
=> this is a branch hypothesis to test 😉
I like that proposed API much more than what we have at the moment
I like that proposed API much more than what we have at the moment
👍
Grey-Area modules :) $ grep -R "get_image_range" dxtbx dials | grep -v "\.pyc" | wc -l
101
Grey-Area modules :) $ grep -R "get_array_range" dxtbx dials | grep -v "\.pyc" | wc -l
109
Non-trivial 🤔
I think translating all get_array_range
use to get_image_range
(or vice versa) could be automatic, but then you get 210 instances that we probably want to check manually to ensure they make sense and convert to use something like scan.first_image_number
and len(imageset)
I'm also not sure if len(imageset)
is sufficient, because sometimes you might want to know the number of images in the scan without accessing the imageset... maybe?
I've had a look at the usage of get_array_range
and get_image_range
in DIALS and have now come to the conclusion that both accessors have their use, but the current naming is terrible. In reality they are different views on the concept of a "Z position" within a scan.
Sometimes you want to know a position in counting numbers of images - "the 5th image is blank" for example. Sometimes you want to know a floating point position in the scan - "the spot is at Z=3.78" for example. So, what should matter is whether you want a discrete or continuous view on the dataset, as illustrated:
I think that get_array_range
should be changed to return a tuple of float
s (at the moment they are int
s) and its name should be changed to something more sensible that doesn't include the irrelevant word "array". Something like get_image_limits
or even get_z_limits
to highlight that it is a continuous quantity.
Other methods should be renamed too, such as get_array_index_from_angle
, perhaps to get_z_from_angle
. Methods like get_image_index_from_angle
confuse things further because they return a continuous quantity rather than a counting number. Luckily this method is not used anywhere in DIALS and could probably be removed.
I like the idea of naming it get_z_*
, to make it clear that it corresponds to the values used for spot centroids. I'd be happy with get_z_range
instead of get_array_range
(returning a tuple of float
s as you suggest).
I propose the following plan:
Scan.get_image_index_from_angle
and Scan.get_image_indices_with_angle
once https://github.com/cctbx/cctbx_project/pull/522 is in. These are not actually used anywhere in DIALS/dxtbx/cctbx but their existence perpetuates the confusion that an "image index" is a continuous quantity*_array_*
methods with *_z_*
(leaving what's matched by the wildcard as-is so far). On the Python side, recover the existing methods as synonyms, but deprecate themScan.get_z_range
to return a tuple of floatsScan.get_angle_from_image_index
(3 in DIALS) with the equivalent call to Scan.get_angle_from_z_index
, then deprecate Scan.get_angle_from_image_index
Scan.get_batch_for_image_index
and Scan.is_image_index_valid
work only with integer image numbersScan.get_batch_for_z_index
? It is not used anywhere we know about.*_z_index*
to *z_position*
? As a continuous quantity it is not indexable.I might have missed a few little bits, but that's the bulk of the untangling between the "array" and "image" view of Z position. Nothing there touches the meaning and use of batch_offset
, which could be addressed separately.
Sounds generally sensible. Perhaps get_angle_from_z_index
could simply be get_angle_from_z
(or get_angle_for_z
)?
Now https://github.com/cctbx/cctbx_project/pull/522 is merged I could start work on the tasks above, if we agree. @graeme-winter, what are your thoughts here?
Following discussion today, we agreed to re-order the proposed scheme of work so that the most invasive changes are made first. This may unearth problems, such as places in the code where array_range
has been interpreted (incorrectly) as actually being an integer index into the data array rather than a z position.
With that in mind, the proposal is now the following. Changes will be required across (at least) dxtbx, DIALS and xia2.
Type changes
Scan.get_array_range
to return a tuple of floats and check each instance makes sense.Scan.get_batch_for_image_index
and Scan.is_image_index_valid
work only with integer image numbers, not floating point positionsDeprecation
Scan.get_image_index_from_angle
and Scan.get_image_indices_with_angle
using boost_adaptbx.boost.python.deprecate_method
. These are not actually used anywhere in DIALS/dxtbx/cctbx but their existence perpetuates the confusion that an "image index" is a continuous quantityScan.get_batch_for_array_index
It is not used anywhere we know about.Renaming
*_array_*
methods with *_z_*
(leaving what's matched by the wildcard as-is so far). On the Python side, recover the existing methods as synonyms, but deprecate them*z_index*
to *z_*
(i.e. get_angle_from_z_index
→ get_angle_from_z
)Scan.get_angle_from_image_index
(3 in DIALS) with the equivalent call to Scan.get_angle_from_z
, then deprecate Scan.get_angle_from_image_index
Clean up
If we're already going down the path of invasive changes, is it worth considering removing the get_
prefix from the method names as we're changing them anyway?
If we're already going down the path of invasive changes, is it worth considering removing the
get_
prefix from the method names as we're changing them anyway?
Rename, add alias as old name w/deprecation warning?
Yes, all renaming proposed includes keeping old methods with deprecation warning. If removing get_*
I would actually prefer to do this wholesale throughout dxtbx rather than getting to a halfway state where some accessors use get_*
etc. and some don't.
Well, you have to start somewhere - doing all get_*
s at the same time as a global change is a very high hurdle.
(If you do remove the get_
prefix then please also turn the methods into properties :slightly_smiling_face:)
Hmm, I'm not too keen to do that here. It wouldn't feel right to only change a few methods in Scan
. I'd rather do the whole of Scan
at once so it is consistent. But that touches at least 22 methods rather than just a handful. Changing to properties - not sure, these are mostly boost python exposed C++
The other thing is that I have much less interest in the get_
prefixes than I do about fixing the issue here.
Ok, I've started looking at this and it is messier than I'd hoped. It was a good idea to move the most invasive changes to the top of the list. Changing the type of Scan.get_array_range
to return doubles in anticipation of renaming it Scan.get_z_range
results in many changes in DIALS to either use Scan.get_image_range
and then subtract one from the first element, or casting to ints immediately after calling. So, although it feels more natural to have floating point bounds for a continuous z range, in practice these are mostly used as integers (and indeed they always take integer values).
What we could do is accept that these bounds are integer, but then these changes are really little more than renaming *_array_*
to *_z_*
and avoiding any more fundamental changes. What do you think?
Having a quick look at the usage of Scan.get_image_range
I suspect this would be the easier of the two concepts to attack. It is quite often used like this:
image_range = records[0].scan.get_image_range()
image_range = (image_range[0], image_range[1] + 1)
that is, it is immediately altered. The upper bound often has one added so that when passed to the range
function it will create an iterator over the image numbers. It is also sometimes used like this:
i.get_scan().get_image_range()[0]
that is, only the first image number is of interest.
So, another possible way to proceed with this issue would be to rename get_array_range
to get_z_range
as proposed, but leave the result as a pair of ints. Remove get_image_range
and replace all calls to appropriate uses of two new functions, get_first_image_index
and get_last_image_index
.
This is less invasive than the previous proposal but still manages to tease apart the potential confusion of having two definitions of what the range of a scan is.
In preparation for tackling this next Tuesday, I wanted to refresh my mind about the issue. In my opinion the most confusing and dangerous (because it leads to wrong assumptions) part of this is the simplest to fix. It is the naming of get_array_range
. As noted above this has nothing to do with the data array. It is a tuple, the first value of which is always one less than the first image number. Even the comment above the function is wrong. It isn't zero-based! If your first image is 100, then the first value of the so-called array_range
is 99.
One of the suggestions above is to rename this function get_z_range
(and similarly for the other misnamed "*array*
" functions. I still like this idea.
It was assumed from day 0 that images went from 1...N so the array offsets went from 0...N-1
Then batch_offset was added to cope with cases where the first image was != 1
This however forever bakes in off-by-one errors e.g. if first image is 0, as the array range starts at -1