NannyML / nannyml

nannyml: post-deployment data science in python
https://www.nannyml.com/
Apache License 2.0
1.84k stars 131 forks source link

Median stat calculation fails when a column has a single value #354

Closed nnansters closed 3 months ago

nnansters commented 7 months ago

Describe the bug Median stat calculation fails when a column has a single value due to sampling error component calculation.

To Reproduce

import pandas as pd

import nannyml as nml

ref = pd.DataFrame({'y_true': [1 for _ in range(1000)]})

calc = nml.stats.median.SummaryStatsMedianCalculator(
    column_names=[
        "y_true"
    ],
    timestamp_column_name='timestamp',
    chunk_period='M',
)
calc.fit(ref)

This will raise the following exception:

nannyml.exceptions.CalculatorException: failed while fitting <nannyml.stats.median.calculator.SummaryStatsMedianCalculator object at 0x7f1ae9560f40>.
The data appears to lie in a lower-dimensional subspace of the space in which it is expressed. This has resulted in a singular data covariance matrix, which cannot be treated using the algorithms implemented in `gaussian_kde`. Consider performing principle component analysis / dimensionality reduction and using `gaussian_kde` with the transformed data.

Expected behavior Sampling error should be 0

Additional context The issue is being caused by the following snippet (in summary_stats.py):

    """
    Calculate sampling error components for Summary Stats Median
    using reference data.

    Parameters
    ----------
    col: pd.Series
        column for which we are calculating sampling error components

    Returns
    -------
    (median, pdf(median): Tuple[np.ndarray]
    """
    median = col.median()
    kernel = gaussian_kde(col)
    fmedian = kernel.evaluate(median)[0]
    return (median, fmedian)
stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.