KaveIO / PhiK

Phi_K correlation analyzer library
Other
155 stars 28 forks source link

bin_edges values are rounded with 1e-14 by Python and np.searchsorted affect the minimum values in bin 0. #60

Closed MLecardonnel closed 8 months ago

MLecardonnel commented 1 year ago

When executing phik.binning.bin_edges, the bin edges can be rounded depending on their value if they are outside of Python range float encoding capacity. It has a big impact on the minimum bin edge as it is setting the minimum values as underflow values with phik.binning.bin_array.

The function doesn't round here:

nbins = 10
arr = pd.Series([128, 129, 130 ,131])

xbins = np.linspace(
            np.min(arr[~np.isnan(arr)]) - 1e-14, np.max(arr[~np.isnan(arr)]), nbins + 1
        )
xbins[0]
127.99999999999999

But it rounds for example values above 128:

nbins = 10
arr = pd.Series([129, 129, 130 ,131])

xbins = np.linspace(
            np.min(arr[~np.isnan(arr)]) - 1e-14, np.max(arr[~np.isnan(arr)]), nbins + 1
        )
xbins[0]
129.0

Thus when executing binned_arr = np.searchsorted(bin_edges, arr).astype(object) in phik.binning.bin_array, all the minimum values are binned 0. Followed by binned_arr[np.argwhere(binned_arr == 0)] = defs.UF, they are all considered as underflow values. In the end all the records concerned are dropped with the default parameter drop_underflow=True with phik.phik_matrix for example.

nbins = 10
arr = pd.Series([129, 129, 130 ,131])

xbins = bin_edges(arr.astype(float), nbins)

bin_array(arr.astype(float).values, xbins)
(array(['UF', 'UF', 5, 10], dtype=object), [(129.8, 130.0), (130.8, 131.0)])

I think it is not really what is wanted from the combinaison of phik.binning.bin_edges and np.searchsorted as underflow and overflow values are at first concepted for custom bins.

mbaak commented 1 year ago

Thanks for reporting this, I will have a look and get back to you. (We can try to resolve this issue for the next patch release.)

MLecardonnel commented 12 months ago

To solve the problem I would propose something like this:

import sys

def bin_edges(
    arr: Union[np.ndarray, list, pd.Series], nbins: int, quantile: bool = False
) -> np.ndarray:
    """
    Create uniform or quantile bin-edges for the input array.

    :param arr: array like object with input data
    :param int nbins: the number of bin
    :param bool quantile: uniform bins (False) or bins based on quantiles (True)
    :returns: array with bin edges
    """

    if quantile:
        quantiles = np.linspace(0, 1, nbins + 1)
        xbins = np.quantile(arr[~np.isnan(arr)], quantiles)
        xbins[0] -= max(1e-14 * abs(xbins[0]), sys.float_info.min)
    else:
        min_value = np.min(arr[~np.isnan(arr)])
        constant = max(1e-14 * abs(min_value), sys.float_info.min)
        xbins = np.linspace(
            min_value - constant, np.max(arr[~np.isnan(arr)]), nbins + 1
        )

    return xbins
mbaak commented 8 months ago

Better late than never. :-)

Fixed in: https://github.com/KaveIO/PhiK/pull/83 Will be included in the upcoming patch release (later this week).