data-apis / array-api

RFC document, tooling and other content related to the array API standard
https://data-apis.github.io/array-api/latest/
MIT License
204 stars 42 forks source link

RFC: add bincount #812

Open ogrisel opened 3 weeks ago

ogrisel commented 3 weeks ago

It seems quite widely adopted:

Although I have not checked how compatible they are.

JAX in particular requires an extra length argument to statically define the length of the output array to make JIT possible and dask uses (nan,) shape to represent a data-dependent output shapes.

rgommers commented 2 weeks ago

Thanks for the suggestion @ogrisel!

bincount is indeed a heavily used function. I did a quick tally: scikit-learn currently has 117 usages of np.bincount, and SciPy 38. It's also used as a primitive for histogram & co, and it's not easy to implement efficiently with other existing functions in the standard. So from that perspective, I think it meets the bar for inclusion fairly easily.

Signatures do all seem compatible. There is at least one difference in behavior I spotted: the accepted input is non-negative integer values, but the case of negative values is not treated the same - it may raise (NumPy), or clip (JAX).

The NumPy issue tracker contains a lot of open issues about bincount, so the constraints on values, weights etc. seem to need careful specification.

The PyTorch docs also contain a warning about non-deterministic gradients - probably related to the data-dependent behavior:

JAX in particular requires an extra length argument to statically define the length of the output array to make JIT possible and dask uses (nan,) shape to represent a data-dependent output shapes.

Yes, that's non-ideal. CuPy also adds a warning about bincount possibly causing synchronization.

The number of APIs that use data-dependent shapes is still quite low. As of now, I think this is the full list:

NeilGirdhar commented 2 weeks ago

Just curious, but are you sure you want to keep using the name bincount? It seems redundant: "binning" means counting. What about bin_integers, bin, bucket, or integer_histogram?

ogrisel commented 2 weeks ago

To me, 'binning' means assigning to a bin. E.g. an array of continuous values (floating point values) is transformed into an array of discrete values (bin indices) typically represented as integers. This is a synonym of discretization.

Then you can aggregate those values as a histogram of per bin counts if you wish, but it's not the only thing you can do. In machine learning, binned values are used in many various ways and only some involve histograms, and often only filtered subsets of the original binned array.

NeilGirdhar commented 2 weeks ago

To me, 'binning' means assigning to a bin. E.g. an array of continuous values (floating point values) is transformed into an array of discrete values (bin indices) typically represented as integers. This is a synonym of discretization.

Right, I agree that that's what happens when you "bin" floating point values. This function (bincount) is just the integer version of binning floating point values. In this integer case, bins can be discrete values (or they could be ranges even though bincount doesn't support ranges).

Then you can aggregate those values as a histogram of per bin counts if you wish,

I understand what you're saying. But, it's also a fact that the result of bincount is by definition already a histogram. This is clear because you can convert any call to bincount into a similar call to histogram (possibly massaging the output to get the types right and shapes right).

I still think that "bincount" is a bad name. It is just "binning", and its name should reflect the technical term that people use. No one calls it the "bincount" or "bincounting".

If I had to guess, I think its name reflects a very human tendency to combine two names that would have each been fine:

I think someone just glued these two names. This is like when people say "step foot" when "set foot" or "step" would each be fine, but the combination just isn't.

NeilGirdhar commented 2 weeks ago

An alternative to choosing a different name would be to replace bincount with a true integer_histogram:

def integer_histogram(x,
                      /,
                      bins: int | Array,
                      range: None | tuple[int, int] = None,
                      density: bool = False,
                      weights: Array
                      ) -> Array:
    """Compute the histogram of an integer dataset.

    Args:
        x: The input data.  The histogram is computed over the flattened array.
        bins: If bins is an int, it defines the number of equal-width bins in the given range.
            If bins is an integer array, it defines a monotonically increasing array of bin edges,
            including the rightmost edges, which allows for non-uniform bin widths.
        range: The lower and upper range of the bins.  range[1] - range[0] + 1 must be divisible by
            the integer bins.
        weights: An array of weights, of the same shape as a. Each value in a only contributes its
            associated weight towards the bin count (instead of 1). If density is True, the weights
            are normalized, so that the integral of the density over the range remains 1.
        density: If False, the result will contain the number of samples in each bin. If True, the
            result is the value of the probability density function at the bin, normalized such that
            the integral over the range is 1. Note that the sum of the histogram values will not be
            equal to 1 unless bins of unity width are chosen; it is not a probability mass function.
    """

It's just a generalization that supports normalization and general-width bins.

Scanning how bincount is actually used, it seems that normalization is very common. So even if we don't add range and a general bins, I think it might be good to add density to cover this common case.