Closed giodavoli closed 11 months ago
Very good catch @giodavoli, I'll take a look.
Thank you for the example and fix suggestion!
Hey @giodavoli , sorry for taking a while...
I've quickly looked into this. I think we'll have to do the same trick with the function that handles the aggregation. The offset_from
function determines what the baseline is we then add and subtract a number of standard deviations to and from.
After replacing the default np.mean
by np.nanmean
in the __init__
I obtained the following result:
kolmogorov_smirnov
value upper_threshold lower_threshold alert
0 0.5 0.5 None False
1 NaN 0.5 None False
2 0.5 0.5 None False
As expected?
Also: since you did all of the heavy lifting here you deserve the credit too. If you'd like you can contribute the fix yourself. I'd be happy to help out with the process if you're not familiar with contributing!
Yes, for sure. I just opened a pull request https://github.com/NannyML/nannyml/pull/333
Describe the bug With the release of version 0.9.0, the
KolmogorovSmirnovStatistic
is now capable of handling features that contain all NaN values for certain chunks returning a nan value, which is a positive development. However, there seems to be an issue with the computation of thresholds for these chunks.To Reproduce Below is a code snippet that demonstrates the problem:
output for
univariate_calculator.result.data['f1']
output forunivariate_calculator.result.data['f2']
You can observe that theupper_threshold
is not correctly computed for the f2 feature.Additional context I'm using NannyML version 0.9.0. While investigating this issue, I found that replacing
np.std
withnp.nanstd
in the threshold calculation forStandardDeviationThreshold
may resolve the problem.