Closed MLecardonnel closed 10 months ago
Thanks for reporting this, I will have a look and get back to you. (We can try to resolve this issue for the next patch release.)
To solve the problem I would propose something like this:
import sys
def bin_edges(
arr: Union[np.ndarray, list, pd.Series], nbins: int, quantile: bool = False
) -> np.ndarray:
"""
Create uniform or quantile bin-edges for the input array.
:param arr: array like object with input data
:param int nbins: the number of bin
:param bool quantile: uniform bins (False) or bins based on quantiles (True)
:returns: array with bin edges
"""
if quantile:
quantiles = np.linspace(0, 1, nbins + 1)
xbins = np.quantile(arr[~np.isnan(arr)], quantiles)
xbins[0] -= max(1e-14 * abs(xbins[0]), sys.float_info.min)
else:
min_value = np.min(arr[~np.isnan(arr)])
constant = max(1e-14 * abs(min_value), sys.float_info.min)
xbins = np.linspace(
min_value - constant, np.max(arr[~np.isnan(arr)]), nbins + 1
)
return xbins
Better late than never. :-)
Fixed in: https://github.com/KaveIO/PhiK/pull/83 Will be included in the upcoming patch release (later this week).
When executing phik.binning.bin_edges, the bin edges can be rounded depending on their value if they are outside of Python range float encoding capacity. It has a big impact on the minimum bin edge as it is setting the minimum values as underflow values with phik.binning.bin_array.
The function doesn't round here:
But it rounds for example values above 128:
Thus when executing
binned_arr = np.searchsorted(bin_edges, arr).astype(object)
in phik.binning.bin_array, all the minimum values are binned 0. Followed bybinned_arr[np.argwhere(binned_arr == 0)] = defs.UF
, they are all considered as underflow values. In the end all the records concerned are dropped with the default parameter drop_underflow=True with phik.phik_matrix for example.I think it is not really what is wanted from the combinaison of phik.binning.bin_edges and np.searchsorted as underflow and overflow values are at first concepted for custom bins.