OxIonics / ionics_fits

Small python fitting library with an emphasis on Atomic Molecular and Optical Physics
Apache License 2.0
0 stars 0 forks source link

Inconsistent arrays in fitter #97

Closed AUTProgram closed 11 months ago

AUTProgram commented 11 months ago

I have seen problems when doing calculations that involve the array of y-values in the fitter and the array returned by model.func(). For example, the operation y - y_fit in method NormalFitter.chi_squared() results in a 2D array when called as part of an experiment. This then leads to an incorrect chi_squared value. The problem here lies in the interaction between ionics_fits classes and the experiments that pass results data into those classes during fitting. This was introduced when we started to support models containing multiple y-channels.

AUTProgram commented 11 months ago

There are two issues on the ionics_fits side: firstly, model functions are in principle expected to return arrays of shape (num_samples, num_y_channels), as indicated by the function annotation for the Model base class. This means that model functions should always return 2D arrays and the shape for a model with just one y-channel should be (num_samples, 1). However, our current models containing only a single y-channel don't actually do this. For example, the annotation for Rabi._func or RabiFlopTime.func all instead say that the returned array has shape (num_samples,), which is a 1D array. This matches the actual implementation: if a 1D array is passed for x, then a 1D array is returned by func(). Secondly, the order of axes for the array returned by _func() or func is such that the sample values correspond to the first axis, while the different y-channels correspond to the second axis.

Apart from the arrays returned by the model functions, the other thing that matters is what kind of arrays are passed into the fitter. Whether or not it matters that 1D or 2D arrays are used, depends on how exactly the arrays are handled. Ultimately, the problem comes down to the fact that arrays of shape (n_samples,), (n_samples, 1) and (1, n_samples) are not necessarily equivalent in numpy. In particular, during an operation involving the latter two, they will be broadcast to shape (n_samples, n_samples), which is the underlying reason for the incorrect calculations mentioned in the first paragraph.

hartytp commented 11 months ago

Yep, this is all correct. Another annoyance here is that np.atleast_2d converts (n_samples,) into (1, n_samples), which leads to a few ugly things in the codebase.

In hindsight, I think it would have been better to have made the layout (num_y_channels, num_samples) instead. With that change we could have the fitter base class pass the return type of model.func through np.atleast_2d to ensure the fit results always have a correct and consistent shape. I'm not sure how much work that would be in practice and whether it's worth fixing now though...

AUTProgram commented 11 months ago

In hindsight, I think it would have been better to have made the layout (num_y_channels, num_samples) instead.

I'm not sure about this. Intuitively, I would put the variable that actually contains the value along the first axis and the variable that corresponds to different channels on the second axis (which is what we have now). This is also consistent, for example, with how matplotlib handles multiple y-series. It also doesn't really matter what order is chosen, as long as it's done consistently everywhere and the correct array shapes are used (although picking one particular order might make the implementation a bit easier, as @hartytp mentions above). The problem is that we're not being consistent at the moment.

I'm not sure how much work that would be in practice and whether it's worth fixing now though...

I keep running into this problem currently when trying to check fit validations and correctly handle fitting for multiple y-channels (as for two-qubit gates for example). It's also a footgun for any future models and functionality. The problem is that the behaviour doesn't result in exceptions, but only incorrect calculations, which are harder to track down. Given that we've only introduced support for multiple y-channels recently, I'd rather fix this now.

AUTProgram commented 11 months ago

For context: the complications in the implementation come from the rules that numpy applies when treating arrays with non-equal shapes. For example, numpy's broadcasting rules mean that when adding two arrays (or doing any other binary operation) shapes change according to

(n,) + (n,) = (n,)
(n,) + (1, n) = (1, n)
(n,) + (n, 1) = (1, n) + (n, 1) = (n, n)

This means that when operating on a 1D and a 2D array, the outcome depends on the shape of the 2D array. In both cases, the arrays are broadcast two a common shape, but the way in which they are broadcast differs.

The np.atleast_2d() function promotes 1D arrays to 2D by prepending another axis, so np.atleast_2d(a) is the same as a[np.newaxis, :]. This means that an array of shape (n,) is promoted to an array of shape (1, n). If this is not desired, because one wants to promote 1D arrays to 2D by appending a new axis, then instead a[:, np.newaxis] can be used.

But I would not say that numpy inherently favours prepending axes when promoting arrays. For example, when using np.atleast_3d() on a 1D array of shape (n,), the new shape is (1, n, 1), where new axes are prepended and appended. So I don't think there is a "natural" way within numpy of adding new axes to lower-dimensional arrays with np.atleast_xxx functions. For np.atleast_2d, they picked one out of the two possible options, presumably because their choice means that (n,) + np.atleast_2d((n,)) = (1, n), which gives the same result as np.atleast_2d((n,) + (n,)). But note that this is not true for np.atleast_3d.

hartytp commented 11 months ago

Exactly: the way numpy is setup, if there is one dimension that is sometimes missing it really needs to be the first dimension, not the second dimension.

AUTProgram commented 11 months ago

Exactly: the way numpy is setup, if there is one dimension that is sometimes missing it really needs to be the first dimension, not the second dimension.

Only if you insist on using np.atleast_2d to add the other dimension. Broadcasting doesn't favour prepending or appending, it just takes two arrays of unequal shapes and promotes them to a common shape.

But anyway, I'll have a look how tricky it would be to keep the current ordering, otherwise might change.

hartytp commented 11 months ago

Only if you insist on using np.atleast_2d to add the other dimension. Broadcasting doesn't favour prepending or appending, it just takes two arrays of unequal shapes and promotes them to a common shape.

Is that right? My concern here is if the user is mainly using data that's shaped as (n_samples,) but ionics_fits is internally using (1,n_samples) it's really easy to get mismatches which end up accidentally creating (n_samples,n_samples). That makes me think that numpy does play more naturally with the elided dimension being first, not second (as we currently have it).

hartytp commented 11 months ago

Chatting this over with @AUTProgram I think there are a few factors here:

  1. If we want to support people eliding the number of y channels (i.e. providing a 1D array rather than a 2D array) and if that's the only elided dimension, from a purely numpy perspective it's probably easier to shape things as (num_y_channels, num_samples). That way one can do maths between (num_samples) and (1,num_samples) and get the correct result.
  2. If we want to make ionics_fits work most ergonomically with matplotlib we should stick with (num_samples, num_y_channels). It's worth noting that matplotlib and numpy aren't totally internally consistent nor are they totally consistent with each other so this is something we shouldn't feel obliged to stick to...but plotting fit results is very common and making that less simple is a clear negative.
  3. It's possible that at some point in the future we will support multi-dimensional x data. If so, we will have more than one potentially elided dimension and the consideration in (1) no longer applies. I think this is something we should weight very weakly because it's not clear that we will ever support multi-dimensional x data (or what that feature would mean) and a bad outcome here would be making the toolkit harder to use to keep the possibility open for speculative features that never get implemented
  4. We should aim for ionics_fits to have clear and consistent semantics (e.g. 1D arrays are always promoted to 2D by fitters) and make it easy for model developers to "do the right thing" (e.g. don't force them to have to deal with promotion / squeezing inside the model to be avoid creating nasty corner-cases).

Having said all of that, I don't have a strong view of what is the least bad option here.

On a day-day basis, (2) is probably the one that will be the most immediately visible / annoying to users. I really like the fact that one can do things like plt.plot(*fit.evaluate()) and loosing that would be a big shame. However, the issues here are at least obvious and easy to fix. In contrast, (1) and (4) are essentially just about avoiding corner cases which probably don't come up too much in practice...but when they do they can be a real PITA to debug (numpy broadcasting rules mean it often silently produces a silly outcome and it can be hard to figure out where things actually went wrong).

So, I think my position here is: I agree that whatever we do we should make ionics_fits internally consistent. Beyond that, I'm happy to take whichever approach you prefer - if we find we don't like it, it's not a huge deal to change later on.

AUTProgram commented 11 months ago

@hartytp Thanks for summarising our in-person discussion.

After considering the issue a bit more, I have come to believe that it's actually more advantageous to change the ordering of samples and y-channels to (num_y_channels, num_samples).

I admit that personally I would find it more intuitive to keep the ordering as we have now. This is because in the 1D case, samples are stored along axis = 0. When moving to 2D arrays, it therefore seems more intuitive to keep storing samples along axis = 0 and y-channels along axis = 1, which results in the current (num_samples, num_y_channels) shape. Additionally, this is also the ordering that matplotlib uses.

However, given the broadcasting rules of numpy, this can lead to problems when handling arrays of different shapes. Anyone developing ionics_fits would need to be very careful and consider different use cases. More importantly, even if ionics_fits handles arrays correctly and consistently, an unnecessary load may be placed on the user to ensure correct array shapes when passing data or using fit results, otherwise unexpected behaviour might occur, with bugs that don't raise errors, but give incorrect numerical results. Therefore, I prefer to change the ordering, with the following reasoning:

1) It can be argued that in the 1D case, samples are stored along axis = -1 (because there is only one axis). When moving to 2D arrays, it would therefore seem intuitive to keep storing samples along axis = -1 and y-channels along the added axis, which is axis = 0 in this case. This results in the shape (num_y_channels, num_samples). 2) Most models have one independent variable and one dependent variable. I believe it's therefore strongly preferable that the user can pass a numpy 1D array (array with ndim = 1) of y-data to a model or fitter and get valid results. They should not have to worry about promoting the array to 2D (array with ndim = 2, but length of one axis equals one) before passing to the model or fitter to get valid results. Besides, if ionics_fits were to support two independent variables at some point, the user might even have to promote to a 3D array to get valid results. This feels very unergonomic and would put me off as a user. 3) Some models have more than one dependent variable. This means an additional axis is required to hold data for different y-channels. Given that we want to support 2), this means that different models or fitters will be passed 1D or 2D y-data. This means that the implementations of Fitter and classes derived from it will need to handle arrays of different shapes. In order to make code development easier (not having to think about what kind of array has been passed and adding if-statements for special cases) and improve readability, it's preferable to promote any array passed into fitters to a common shape. For now, the largest shape supported is 2D (but if multiple independent variables were supported, this might be even higher-dimensional). So for now, that common shape would be a 2D array. 4) This leaves the question of how 1D arrays should be promoted to 2D arrays and what order should be chosen for the y-channels. One possible option is to choose (num_samples, num_y_channels), which is what we're doing at the moment. We therefore effectively promote (n,) to (n, 1) internally and store the reshaped x, y, (and sigma if relevant) as attributes of the fitter, after some data validation. But these attributes can be freely accessed by the user (and should be freely accessible in my opinion). So now consider for example the case where the user passes 1D x- and y-arrays of shape (n,), all values are found to be valid and the user calculates y - fit.y. Because of numpy's broadcasting rules, this results in an array of shape (n, n), with all elements zero. I strongly dislike this and would not expect it as a user. The fitter may still be doing the correct thing, but the user's x- and y-arrays are no longer consistent with what the fitter stores. The same problem happens with models that may output 2D arrays with one axis of length one, as currently happens for RepeatedModel for example: calculating y - fit.model.function(x) results in a (n, n) array. This is the underlying cause of the bug mentioned at the top of this issue, where chi_squared is calculated incorrectly. Importantly, the user can't even apply np.squeeze() to get rid of unwanted axes because neither axis is of length one. 5) Therefore, I prefer choosing the ordering (num_y_channels, num_samples). This would still result in 2D arrays when calculating y - fit.y and y - fit.model.function(x), but they would be of shape (1, n) and the user could squeeze to remove the unwanted axis. In general, I would expect calculations to work as intended, with the slight annoyance that sometimes an axis of length one is added.

Ultimately, this all comes down to the first rule of broadcasting: shapes are compared element-wise, starting with the trailing dimension. Personally, I would have preferred if it started with the leading dimension, but can't change that unfortunately.

It's possible to argue about points 1 - 5 above or have a different preference, but at least this post now documents the reasoning behind the ordering chosen and we can come back to it in the future if we want to change behaviour.

hartytp commented 11 months ago

@AUTProgram thanks for the clear summary. For avoidance of doubt, when we say that matplot lib doesn't like the changes, we mean something like

import numpy as np
from matplotlib import pyplot as plt

x = np.atleast_2d(np.arange(10))
y = np.atleast_2d(np.arange(10))

plt.plot(x,y)
plt.show()

will try to plot 10 single-point curves rather than a single 10-point curve. How annoying! However, plot.plot(x.T, y.T) works as desired.

This is something which makes me really annoyed! I really like the pattern plt.plot(*fit.evaluate()) and it makes me sad to loose it!

hartytp commented 11 months ago

The only other thing I can think of is offering the guarantee that all results from the fitter will be squeeze'd, but that also feels annoying and will lead to inconsistencies. boo

AUTProgram commented 11 months ago

The only other thing I can think of is offering the guarantee that all results from the fitter will be squeeze'd, but that also feels annoying and will lead to inconsistencies. boo

Even if you squeeze, a 2D array resulting from multiple y-channels will remain 2D and then you still need to transpose. It's only for models with a single y-channel where squeezing would simplify things. What we could do is let the fit.evaluate() return a transposed y-array. But that feels inconsistent and prone to causing problems in other places.

AUTProgram commented 11 months ago

This is something which makes me really annoyed! I really like the pattern plt.plot(*fit.evaluate()) and it makes me sad to loose it!

I would have preferred to keep that functionality too, but for the reasons outlined above, I think that losing this particular neatness is the lesser of two evils in this case. People can still do x_fit, y_fit = fit.evaluate() and plt.plot(x, y.T), or squeeze y_fit if there's only a single y-channel.

hartytp commented 11 months ago

we could maybe also have fit.evaluate(transpose=True) to return the data in transposed format for this reason?

AUTProgram commented 11 months ago

we could maybe also have fit.evaluate(transpose=True) to return the data in transposed format for this reason?

Yes, that sounds like a viable option. I will add this functionality.

AUTProgram commented 11 months ago

Resolved in https://github.com/OxIonics/ionics_fits/pull/98 and https://github.com/OxIonics/ionics_fits/pull/99