Open mdhaber opened 2 months ago
Thanks for the proposal @mdhaber. I don't quite have an opinion yet - I think it in part depends on the situation with methods (see below).
Also - how much do you actually need this? I only count two instances of it being used in SciPy, one of which is a test case. I just did a grep, so I may be missing some dynamic usage perhaps. The one instance is pretty simple, no keyword usage:
Only a few methods would be required by the standard,
Which ones? Could we get away with only a default method, and hence no method
keyword?
and if choice of a default is too contentious, perhaps the array-API can consider
method
to be a required keyword argument, and libraries would be welcome to keep their own default.
Related is our previous discussion on median
, which we left out due to implementation difficulty in distributed contexts. That stated, Dask does implement median
, so may be time to revisit this.
Also - how much do you actually need this?
Personally, I don't need it very badly. You're right that quantile
/percentile
only appear in a few stats
functions (stats.bootstrap
, stats.iqr
, and some distribution fitting code), and it is easy to compute quantiles
using existing array-API functions namely (sort
). However, when only a few percentiles
are needed, it is more efficient to use a partition approach. One could propose adding partition
, since it is more general, but I would only use it for quantile
.
Which ones? Could we get away with only a default method, and hence no method keyword? The standard reference is Sample Quantiles in Statistical Packages.
The current variation in sample quantile definitions causes confusion, and so there is a need to standardize the definition of sample quantile across packages and within packages... and we propose that $Q_8$ is the best choice
$Q_8$ corresponds with NumPy's median_unbiased
. However, $Q_7$ (linear
) is the default in most libraries. It would be unfortunate to have to choose between the two. $Q_5$ (hazen
) also satisfies many desirable properties. So I suppose I would suggest having these three in that order of priority.
One wrench I'd like to throw into my own proposal:
The current behavior of np.quantile(a, q, axis)
(and other implementations that follow NumPy) is for the quantiles specified by q
, a 1D array-like, to be computed for all axis
-slices of the data a
. This makes it impossible to efficiently compute a different quantile for each axis-slice of a
(which is needed, for example, in a vectorized implementation of the BCA-bootstrap).
I would propose following normal broadcasting rules between a
and q
(possibly with the constraint that q
must be 0D or have length 1
along axis
), which would meet both needs. A hybrid possibility, which would just be an extension of the existing behavior, would be to allow q
to be an N-D array that is broadcastable with a
except along axis
. In this case, for each pair of corresponding axis-slices a_i
and q_i
, it would compute all quantiles of a_i
specified by q_i
.
I did a bit of digging, with the disclaimer that my digging is incomplete, but I'll try to summarize my initial findings below.
I took a peak at which APIs were implemented across array libraries, spot-checked whether/how they were implemented, and did a search for which APIs were used in SciPy and sklearn.
partition
quantile
.median
median
API, instead deferring to percentile
partition
sort
lower
, higher
, midpoint
, nearest
, and linear
percentile
quantile
quantile
linear
, lower
, higher
, nearest
, and midpoint
methodssort
I reviewed scipy.stats
more thoroughly for uses of median
, quantile
, and/or percentile
. The first three functions would benefit from having quantile
in the array API; for the rest, median
would suffice.
scipy.stats.bootstrap
uses percentile
here, but I would be happy to change it to use quantile
. It would benefit from a quantile
function with improved vectorization as described above.scipy.stats.iqr
uses percentile
here, but I'd be happy to change it to use quantile
.scipy.stats.epps_singleton_2samp
uses iqr
here.scipy.stats.siegelslopes
is currently implemented using Pythran, but I think it can be reasonably fast without Pythran if we wanted to use other backends. (Could Pythran dispatch to other backends in the future?) It uses median
here and again outside the loop.scipy.stats.theilslopes
uses median
here.scipy.stats.median_test
, as one might imagine, uses median
here.scipy.stats.levene
uses median
here.scipy.stats.fligner
uses median
here.scipy.stats.mstats.sen_seasonal_slopes
uses np.ma.median
here, but one could argue that this should be implemented in stats
with regular np.median
.There are also uses of partition
in stats.trim
functions (trim_mean
, trimboth
, trim1
).
Most of these functions have something else that would make array API conversion challenging at the moment, but I don't think any have non-starters for array API support. I've omitted uses I don't think will get array API support any time soon (e.g. levy_stable
). There are also uses in other sub-packages (e.g. scipy.ndimage.median_filter), but I'm less familiar with their prospects for array API support.
Ideally it would be nice to have support for weights and filtering missing values automatically.
NumPy 2 recently added support for weighted percentiles/quantiles via the weights
kwarg:
Adding support for nan value value filtering is being discussed in #621.
However this is not necessarily easy to handle both weights and missing values. There is a related pull request in NumPy here:
In scikit-learn we maintain our own _weighted_percentile
helper to deal with this and it's used in several places of our code base.
I'm working on adding array API support in
scipy.stats
(scipy/scipy#20544) and one of the the things I'll need is aquantile
function. If there is some support for this idea, I'll convert this issue into a proper proposal.Looks like there is already wide support:
numpy.quantile
torch.quantile
cupy.quantile
jax.numpy.quantile
dask.dataframe.DataFrame.quantile
tfp.stats.quantiles
xarray.DataArray.quantile
Previous discussions (not much):
There are many conventions for calculating quantiles. Only a few methods would be required by the standard, and if choice of a default is too contentious, perhaps the array-API can consider
method
to be a required keyword argument, and libraries would be welcome to keep their own default.