Open hadjipantelis opened 1 year ago
I think the solution to this is not doing any interpolation along the "x" dimension. If we do any smoothing, it should be when computing the hdi in the similar was as computing the quantiles do (this is the discussion in the other issue you linked, the main con of this is it is not clear how to go about this for hdi, at least to me).
I would support not smoothing by default. Also always in favour of adding better docs. The whole smooting procedure along with any potential warnings should ideally be documented in the notes section. See https://numpydoc.readthedocs.io/en/latest/format.html#notes for reference on the documentation style we aim to follow. Very happy to provide support if you can help with that.
I am not sure it is possible to compute things in the log-odds domain and then transform these back. Computing quantiles should be the same before and after a monotonic transformation, but even for monotonic transformations, the mean of the transformed samples is not the same as the transformation of the mean, the same happens for the hdi.
(Sorry @OriolAbril, I missed this when you replied originally.)
arviz
or bambi
though? plot_hdi
is that the _hdi
computes the interval_width
in the probability domain. If that interval was calculated in the log-odds domain, and we got the bounds in that domain and then transformed them to the probability they should be correct.Here is an example to try and illustrate the different interpolations:
rng = np.random.default_rng(42)
a = rng.normal([0, 0, 0, 1, 1, 1, 1, 1, 1], size=(10, 9))[:, [0, 6, 7]]
# a has now shape (10, 3)
We will take the first dimension to be equivalent to draw/sample one, the 2nd one is the x dimension. We take x to take values 0, 0.3, 0.6
When we use numpy quantile function, we get values that are not exactly any of the values in our array. It interpolates between the two closest values and finds a value between them that it then returns. It is therefore not possible to have values larger than the max in the array returned by quantile.
np.quantile(a, 0.9, axis=0)
# array([0.61544513, 1.62616018, 1.49426994])
Here are the larger two values in each column for comparison:
np.sort(a, axis=0)[-2:, :]
# array([[0.58622233, 1.62559039, 1.48074666],
# [0.8784503 , 1.63128823, 1.61597942]])
HDI on the other hand returns exactly a value from the array:
az.hdi(a[None, ...])[:, 1]
# array([0.8784503 , 1.63128823, 1.61597942])
This makes HDI less "stable" and leads to the "wigglier" plot shown in https://github.com/arviz-devs/arviz/issues/2168#issue-1464918204. To try and fix this a bit, plot_hdi
does some interpolation too.
However, this interpolation is radically different from the one in quantile. In this one, we take the values returned from HDI and their corresponding x and find a polinomial function that goes through those points. Here we have 3 points so we can make an exact quadratic interpolation (it is not the method used by plot_hdi but it is illustrative). It looks like this:
It can clearly be seen that it goes over the maximum values present in the array, which is something I think we don't want to happen, and it can break model constraints.
Referencing https://github.com/arviz-devs/arviz/issues/2168#issue-1464918204 again, the 2nd plot looks much smoother, but there is no interpolation along the x axis. If we compute the quantiles column-wise we'll get the exact same values, and they will always preserve the contraints imposed by the samples. This is why I was advocating for a similar interpolation scheme in hdi so we can then get smoother looking results without the need for interpolation/smoothing along the x dimension
Yes, I see your point. Just a clarification: in #2168 I question whether using standard t-distribution approximation is preferrable to using percentile CIs. Which of the two would you consider more appropriate? (or neither).
Describe the bug HDI smoothed intervals can result in impossible values. Furthermore, the smoothness parameter can have a unexpectedly large effect.
To Reproduce
Bambi code (somewhat irrelevant for the issue)
Plotting the bands now. Default behaviour in red.
Expected behavior Not have the HDI upper limit go above 1.0 when using smoothing. Maybe do not smooth by default? I appreciate that the HDIs should probably be computed in the log-odds domain but here we are. This likely going to be a case for many "back-transformed" estimates. Would it be reasonable to add a "link" or a "family" argument, so the HDI calculations are done in their relevant domain? (Secondary point, I was a bit surprised that smoothing with a lower order polynomial results in so visibly different HDIs - maybe have a comment about it in docs?)
Issue is loosely related to #2168.