data-apis / array-api-extra

Extra array functions built on top of the array API standard.
http://data-apis.org/array-api-extra/
MIT License
3 stars 1 forks source link

ENH: a simple `cov` #10

Closed lucascolley closed 2 weeks ago

lucascolley commented 1 month ago

This would replace a private utility in SciPy.

lucascolley commented 1 month ago

cov

This PR implements a small subset of numpy.cov, namely cov(m), to replace a private helper in SciPy.


Good to proceed? EDIT: edited to make mean private.

rgommers commented 1 month ago

Why add mean? It seems from https://github.com/data-apis/array-api/issues/846 that it's only an oversight that complex support wasn't added, and that can be done. I would just finish that discussion, and then likely nothing else has to change (except perhaps deleting a check in array-api-compat). Moving usages of mean over to this package and then back a little later sound like quite a bit of churn for no real reason.

lucascolley commented 1 month ago

We could keep complex mean private here until there is a reported use case for it, and remove it once it is supported in array-api-compat - I think that sounds like a better idea than exposing it just because we can.

rgommers commented 1 month ago

That sounds good to me.

asmeurer commented 1 month ago

and then likely nothing else has to change (except perhaps deleting a check in array-api-compat)

I guess you mean array-api-strict. Complex mean already works with the upstream libraries (at least NumPy and PyTorch, I didn't check the others), and will work with array-api-compat.

Also, since array-api-strict now supports multiple versions, we can easily add a preliminary 2024.12 version and disable the complex check in mean for that version, even before that version of the standard is released.

kgryte commented 2 weeks ago

NumPy, JAX and PyTorch all agree on the same default behaviour for single-argument cov (unbiased estimate)

FWIW, TensorFlow disagrees here.

This aside, it is somewhat unfortunate that the default behavior in NumPy for cov (bias = False) diverges from the default behavior of var and std. The use of bias appears to be legacy, being subsequently supplanted by ddof. If we were designing from first principles and considering adding to the standard, I'd argue for being consistent with existing precedent in the standard, rather than having diverging conventions, especially as, while N - 1 is common, it is by no means the universally appropriate normalization factor, and my inclination would be to force users to be explicit about the normalization factor they believe suitable.

rgommers commented 2 weeks ago

If we were designing from first principles and considering adding to the standard, I'd argue for being consistent with existing precedent in the standard

But we aren't here, so I think matching what NumPy, JAX and PyTorch was fine for this PR. Making changes just makes it harder to use, so let's not cross that bridge until we have to for standardization (which may well be never).

kgryte commented 2 weeks ago

But we aren't here... Making changes just makes it harder to use

I'm not sure that is the case. TMK, Array API extras is not restricted to being a compat layer, and as an independent library, could be free to make design choices which (mildly in this case) diverge from existing libraries such as NumPy which carry historical context.

The notion of "harder to use" is somewhat in the eye of the beholder here. Yes, following existing NumPy practice (which is based on a bias boolean kwarg which would be highly unlikely to be standardized given standardized var and std convention) would mean that downstream libraries such as SciPy could simply swap np.cov for xpx.cov, but this also means persisting the additional cognitive burden among future code readers in recalling that default conventions differ between the univariate and multivariate cases. Personally, I place greater value on the latter than the former, and more so in this case as Array API extras has some freedom to evolve the story a bit.

rgommers commented 1 week ago

but this also means persisting the additional cognitive burden among future code readers in recalling that default conventions differ between the univariate and multivariate cases

I don't think this is an extra burden, I'm reasonably confident that matching cov in other libraries is more important and less confusing on balance than matching std/var would be. You'd be right for a reader that knows the math, doesn't know or use any other array library, and comes at this fresh using only the standard and array-api-extra - but that will be rare.