Incorrect Threshold Computation for KolmogorovSmirnovStatistic When Handling NaN Values.

giodavoli commented 1 year ago

Describe the bug With the release of version 0.9.0, the KolmogorovSmirnovStatistic is now capable of handling features that contain all NaN values for certain chunks returning a nan value, which is a positive development. However, there seems to be an issue with the computation of thresholds for these chunks.

To Reproduce Below is a code snippet that demonstrates the problem:

import numpy as np
import pandas as pd
import nannyml as nml

df = pd.DataFrame({'date':['2023-01-01','2023-01-01','2023-01-02','2023-01-02','2023-01-03','2023-01-03'],'f1':[1,2,3,4,5,6],'f2':[7,8,np.nan,np.nan,9,10]})
univariate_calculator = nml.UnivariateDriftCalculator(
        column_names=['f1','f2'],
        timestamp_column_name='date',
        continuous_methods=['kolmogorov_smirnov'],
        chunk_period='1d'
    )
univariate_calculator.fit(reference_data=df)

output for univariate_calculator.result.data['f1'] output for univariate_calculator.result.data['f2'] You can observe that the upper_threshold is not correctly computed for the f2 feature.

Additional context I'm using NannyML version 0.9.0. While investigating this issue, I found that replacing np.std with np.nanstd in the threshold calculation for StandardDeviationThreshold may resolve the problem.

nnansters commented 1 year ago

Very good catch @giodavoli, I'll take a look.

Thank you for the example and fix suggestion!

nnansters commented 11 months ago

Hey @giodavoli , sorry for taking a while...

I've quickly looked into this. I think we'll have to do the same trick with the function that handles the aggregation. The offset_from function determines what the baseline is we then add and subtract a number of standard deviations to and from.

After replacing the default np.mean by np.nanmean in the __init__ I obtained the following result:

  kolmogorov_smirnov                                       
                   value       upper_threshold lower_threshold  alert
0                0.5             0.5            None  False
1                NaN             0.5            None  False
2                0.5             0.5            None  False

As expected?

Also: since you did all of the heavy lifting here you deserve the credit too. If you'd like you can contribute the fix yourself. I'd be happy to help out with the process if you're not familiar with contributing!

giodavoli commented 11 months ago

Yes, for sure. I just opened a pull request https://github.com/NannyML/nannyml/pull/333

NannyML / nannyml

Incorrect Threshold Computation for KolmogorovSmirnovStatistic When Handling NaN Values. #327