cctbx / dxtbx

Diffraction Experiment Toolbox
BSD 3-Clause "New" or "Revised" License
2 stars 18 forks source link

image_range vs. array_range baked in off-by-one errors #186

Open graeme-winter opened 4 years ago

graeme-winter commented 4 years ago

It was assumed from day 0 that images went from 1...N so the array offsets went from 0...N-1

Then batch_offset was added to cope with cases where the first image was != 1

This however forever bakes in off-by-one errors e.g. if first image is 0, as the array range starts at -1

graeme-winter commented 4 years ago

Proposal: eliminate image_range as a data object, only retain it at the UI level.

Redefine batch_offset to present the difference between the start of frame 0 (i.e. start of array) and the first image number i.e. foo_00001.cbf would be 1. This is a different definition to the currently used one.

This fix would not be backwards compatible.

graeme-winter commented 4 years ago

https://github.com/dials/dials/issues/587 related

rjgildea commented 4 years ago

Reappropriating batch_offset might not be so simple, as this was specifically introduced to allow tracking/handling of unique batch numbers for multi-crystal datasets in the mtz sense: https://github.com/cctbx/dxtbx/commit/1a7b04e247e7adce71b3e944bc91654f8a7c8f80

graeme-winter commented 4 years ago

Is that how it is actually used though?

I ask this as someone who's had to poke at batch_offset in scans a fair bit recently, in processing single scan data sets - and it's an eternal source of confusion. It may also be used for that.

I'd like to think that for us we internally consistently count from 0, not from 1, and the fact that MTZ needs to count from 1 should be handled in exporting MTZ files, I believe.

At the moment we count from 🎲(0,1) depending on where in the code we are, which I think is the worst possible outcome.

rjgildea commented 4 years ago

Is that how it is actually used though?

Yes, this is what it was added for. It is used to ensure that mtz batch offsets remain consistent for a given data set across multiple possible subsets of data sets:

https://github.com/xia2/xia2/blob/f68bb2b07aeffb3fb6b9b6509939ae1dd4c483d2/Modules/MultiCrystal/ScaleAndMerge.py#L194-L204

https://github.com/xia2/xia2/blob/f68bb2b07aeffb3fb6b9b6509939ae1dd4c483d2/Modules/MultiCrystal/ScaleAndMerge.py#L274

https://github.com/dials/dials/blob/9d2b5d42909673f902f6e610a298026b6197d14c/util/filter_reflections.py#L191

https://github.com/dials/dials/blob/efff60c4ee60ef9480718d044f614747edfea9fb/util/export_mtz.py#L472-L477

I'm not sure that batch_offset was ever strictly used/relevant in the context of processing a single scan - i.e. leaving it as the default value of 0 should work downstream e.g. with export_mtz.

graeme-winter commented 4 years ago

Fair enough - however if you have a scan which starts at image != 1 you will find that it is used in processing. You'll also find yourself in the place which sends you mad trying to make sense of it (and dealing with the fallout thereof in e.g. output of histogram from dials.find_spots.

Perhaps much of the confusion of this is we have 1 number which needs to do (probably) n >= 2 things, and which thing you think is the main thing depends on what you have looked at most recently 🙂

rjgildea commented 4 years ago

I don't see where it is used in processing outside of the two places in dials I linked above - i.e. in export_mtz and filter_reflections: https://github.com/dials/dials/search?q=get_batch_offset&unscoped_q=get_batch_offset

graeme-winter commented 4 years ago
Grey-Area tmp :) $ dials.import ~/data/i04_bag_training/th_8_2_002?.cbf.gz
DIALS (2018) Acta Cryst. D74, 85-97. https://doi.org/10.1107/S2059798317017235
DIALS 3.dev.117-g6449cf6e6
The following parameters have been modified:

input {
  experiments = <image files>
}

--------------------------------------------------------------------------------
  format: <class 'dxtbx.format.FormatCBFMiniPilatusDLS6MSN100.FormatCBFMiniPilatusDLS6MSN100'>
  num images: 10
  sequences:
    still:    0
    sweep:    1
  num stills: 0
--------------------------------------------------------------------------------
Writing experiments to imported.expt
Grey-Area tmp :) $ grep offset imported.expt 
          "raw_image_offset": [
        "raw_image_offset": [
      "batch_offset": 19,

Then also here (which is wrong)


Histogram of per-image spot count for imageset 0:
2639 spots found on 9 images (max 478 / bin)
        *
*       *
*       *
*       *
*       *
*********
*********
*********
*********
*********
20     28

--------------------------------------------------------------------------------
Saved 2639 reflections to strong.refl

etc.

You'll find that it is touched all over the shop

rjgildea commented 4 years ago

I see that commit https://github.com/cctbx/dxtbx/commit/1b793ec7a9080706085a415b6d2ae69feb2fbc5f introduced its use in a different context for which it was originally introduced in https://github.com/cctbx/dxtbx/commit/1a7b04e247e7adce71b3e944bc91654f8a7c8f80. It was never meant to be set on import according to the particular image range that was used, i.e. it should be the default, i.e. 0, whichever image range was used.

dagewa commented 4 years ago

The array_range is always equal to image_range[0] - 1, image_range[1] https://github.com/cctbx/dxtbx/blob/bbb9f7eb50340811f1a3b942fb0eb416fa0218c5/model/scan.h#L194-L197 so the array_range concept is not useful for actually finding an array index to address the data.

For a dataset not starting at image zero dials.import centroid_000{8..9}.cbf

>>> el[0].scan.get_array_range()
(7, 9)
>>> el[0].scan.get_image_range()
(8, 9)
>>> el[0].imageset.get_raw_data(0)
(<scitbx_array_family_flex_ext.int object at 0x7f7abc524df0>,)
>>> el[0].imageset.get_raw_data(8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dxtbx Internal Error: /home/fcx32934/sw/cctbx/modules/dxtbx/imageset.h(624): DXTBX_ASSERT(index < indices_.size()) failure.
dagewa commented 4 years ago

In my opinion, the current array_range is the more broken concept here. I believe a useful first step to tidying up would be to replace every use of this with image_range[0] - 1, image_range[1] (or whatever is appropriate on a case-by-case basis), then remove the array_range entirely from the API.

graeme-winter commented 4 years ago

🤔

I am not sure that I jump to the same conclusion on this topic. In fact, I almost certainly jump to the alternate conclusion of using array_range as the canonical representation with some integral offset which maps us to people-image-number

Reasoning: the array offset in the general sense, not the sense as it is used in dxtbx, is well defined and well understood. Image range can easily count from some random integer so is not universally well defined.

I accept that this distinction could well be representative of a point of view.

I also recognise that removing exactly one of these could be done as a semi-automated transformation which could be the first (or zeroth) commit on a branch to investigate these topics.

graeme-winter commented 4 years ago

Suggest here we are in a good position to start with two branches to investigate this - route A and route B, then discuss the changes and select the preferable solution? Based on how invasive / insane the change sets end up looking.

rjgildea commented 4 years ago

I think David's point is that in its current form, I'm not sure what array_range is actually useful for. And if change it such that it always starts at 0, then it still isn't useful, as it would always be (0, len(imageset)).

dagewa commented 4 years ago

Yes, I don't think A and B are incompatible. The point is that the current array_range concept does not make sense. We could aim for a future in which we only store an offset to go from array index to people index, but to get there we anyway have to remove what we currently call array_range

dagewa commented 4 years ago

Indeed @graeme-winter your point

I also recognise that removing exactly one of these could be done as a semi-automated transformation which could be the first (or zeroth) commit on a branch to investigate these topics.

Was in fact the only change I was proposing so far. Actually, I don't mind if we remove image_range instead. As we noted before, often we call one of these ranges only to use one element. It does make more sense to me if eventually Scan just stored the number of images and a computer-to-people offset number as separate attributes.

graeme-winter commented 4 years ago

I genuinely think the overhead of looking at this from both perspectives is justified.

@rjgildea - if array_range is not useful as such (i.e. always starts from 0 under any circumstance) then that is still a useful outcome - in fact even more so since we would iterate through images with something like

for i in range(len(imageset)):
  print(f"The {i}'th image is called {imageset.get_people_number(i)}")

and

for i in range(len(imageset)):
  data = imageset.get_raw_data(i)

would do the same as as possible

for data in imageset:
  #things 
graeme-winter commented 4 years ago

I maybe feel strongly enough that perhaps this should be the API: we make sure that len() works as expected and provide an integral offset to transform to people numbers (which makes things internally sane and transforms at the edges, as proposed by @ndevenish )

graeme-winter commented 4 years ago

=> this is a branch hypothesis to test 😉

dagewa commented 4 years ago

I like that proposed API much more than what we have at the moment

graeme-winter commented 4 years ago

I like that proposed API much more than what we have at the moment

👍

graeme-winter commented 4 years ago
Grey-Area modules :) $ grep -R "get_image_range" dxtbx dials | grep -v "\.pyc" | wc -l
     101
Grey-Area modules :) $ grep -R "get_array_range" dxtbx dials | grep -v "\.pyc" | wc -l
     109

Non-trivial 🤔

dagewa commented 4 years ago

I think translating all get_array_range use to get_image_range (or vice versa) could be automatic, but then you get 210 instances that we probably want to check manually to ensure they make sense and convert to use something like scan.first_image_number and len(imageset)

dagewa commented 4 years ago

I'm also not sure if len(imageset) is sufficient, because sometimes you might want to know the number of images in the scan without accessing the imageset... maybe?

dagewa commented 4 years ago

I've had a look at the usage of get_array_range and get_image_range in DIALS and have now come to the conclusion that both accessors have their use, but the current naming is terrible. In reality they are different views on the concept of a "Z position" within a scan.

Sometimes you want to know a position in counting numbers of images - "the 5th image is blank" for example. Sometimes you want to know a floating point position in the scan - "the spot is at Z=3.78" for example. So, what should matter is whether you want a discrete or continuous view on the dataset, as illustrated:

image_and_array_small

I think that get_array_range should be changed to return a tuple of floats (at the moment they are ints) and its name should be changed to something more sensible that doesn't include the irrelevant word "array". Something like get_image_limits or even get_z_limits to highlight that it is a continuous quantity.

Other methods should be renamed too, such as get_array_index_from_angle, perhaps to get_z_from_angle. Methods like get_image_index_from_angle confuse things further because they return a continuous quantity rather than a counting number. Luckily this method is not used anywhere in DIALS and could probably be removed.

rjgildea commented 4 years ago

I like the idea of naming it get_z_*, to make it clear that it corresponds to the values used for spot centroids. I'd be happy with get_z_range instead of get_array_range (returning a tuple of floats as you suggest).

dagewa commented 4 years ago

I propose the following plan:

I might have missed a few little bits, but that's the bulk of the untangling between the "array" and "image" view of Z position. Nothing there touches the meaning and use of batch_offset, which could be addressed separately.

rjgildea commented 4 years ago

Sounds generally sensible. Perhaps get_angle_from_z_index could simply be get_angle_from_z (or get_angle_for_z)?

dagewa commented 4 years ago

Now https://github.com/cctbx/cctbx_project/pull/522 is merged I could start work on the tasks above, if we agree. @graeme-winter, what are your thoughts here?

dagewa commented 3 years ago

Following discussion today, we agreed to re-order the proposed scheme of work so that the most invasive changes are made first. This may unearth problems, such as places in the code where array_range has been interpreted (incorrectly) as actually being an integer index into the data array rather than a z position.

With that in mind, the proposal is now the following. Changes will be required across (at least) dxtbx, DIALS and xia2.

Type changes

Deprecation

Renaming

Clean up

rjgildea commented 3 years ago

If we're already going down the path of invasive changes, is it worth considering removing the get_ prefix from the method names as we're changing them anyway?

graeme-winter commented 3 years ago

If we're already going down the path of invasive changes, is it worth considering removing the get_ prefix from the method names as we're changing them anyway?

Rename, add alias as old name w/deprecation warning?

dagewa commented 3 years ago

Yes, all renaming proposed includes keeping old methods with deprecation warning. If removing get_* I would actually prefer to do this wholesale throughout dxtbx rather than getting to a halfway state where some accessors use get_* etc. and some don't.

Anthchirp commented 3 years ago

Well, you have to start somewhere - doing all get_*s at the same time as a global change is a very high hurdle.

(If you do remove the get_ prefix then please also turn the methods into properties :slightly_smiling_face:)

dagewa commented 3 years ago

Hmm, I'm not too keen to do that here. It wouldn't feel right to only change a few methods in Scan. I'd rather do the whole of Scan at once so it is consistent. But that touches at least 22 methods rather than just a handful. Changing to properties - not sure, these are mostly boost python exposed C++

dagewa commented 3 years ago

The other thing is that I have much less interest in the get_ prefixes than I do about fixing the issue here.

dagewa commented 3 years ago

Ok, I've started looking at this and it is messier than I'd hoped. It was a good idea to move the most invasive changes to the top of the list. Changing the type of Scan.get_array_range to return doubles in anticipation of renaming it Scan.get_z_range results in many changes in DIALS to either use Scan.get_image_range and then subtract one from the first element, or casting to ints immediately after calling. So, although it feels more natural to have floating point bounds for a continuous z range, in practice these are mostly used as integers (and indeed they always take integer values).

What we could do is accept that these bounds are integer, but then these changes are really little more than renaming *_array_* to *_z_* and avoiding any more fundamental changes. What do you think?

dagewa commented 3 years ago

Having a quick look at the usage of Scan.get_image_range I suspect this would be the easier of the two concepts to attack. It is quite often used like this:

            image_range = records[0].scan.get_image_range()
            image_range = (image_range[0], image_range[1] + 1)

that is, it is immediately altered. The upper bound often has one added so that when passed to the range function it will create an iterator over the image numbers. It is also sometimes used like this:

            i.get_scan().get_image_range()[0]

that is, only the first image number is of interest.

So, another possible way to proceed with this issue would be to rename get_array_range to get_z_range as proposed, but leave the result as a pair of ints. Remove get_image_range and replace all calls to appropriate uses of two new functions, get_first_image_index and get_last_image_index.

This is less invasive than the previous proposal but still manages to tease apart the potential confusion of having two definitions of what the range of a scan is.

dagewa commented 5 months ago

In preparation for tackling this next Tuesday, I wanted to refresh my mind about the issue. In my opinion the most confusing and dangerous (because it leads to wrong assumptions) part of this is the simplest to fix. It is the naming of get_array_range. As noted above this has nothing to do with the data array. It is a tuple, the first value of which is always one less than the first image number. Even the comment above the function is wrong. It isn't zero-based! If your first image is 100, then the first value of the so-called array_range is 99.

https://github.com/cctbx/dxtbx/blob/6818056ac49faf6b03eb8dfe13469f684ea67b13/src/dxtbx/model/scan.h#L278-L281

One of the suggestions above is to rename this function get_z_range (and similarly for the other misnamed "*array*" functions. I still like this idea.