NannyML / nannyml

nannyml: post-deployment data science in python
https://www.nannyml.com/
Apache License 2.0
1.97k stars 139 forks source link

change assumed `treat_as_categorical` #397

Closed Duncan-Hunter closed 4 months ago

Duncan-Hunter commented 5 months ago

When using JS distance with UnivariateDriftCalculator with a small number of unique values in a continuous column, currently the library decides to treat it as categorical, which I can sort of understand? However, if the user knows that a feature is continuous and wants it to be treated as such, there is no option. The problem I'm seeing is that a small change in values in analysis leads to a large drift score, because these floats aren't equal to the categories.

Describe the solution you'd like

Describe alternatives you've considered

nikml commented 5 months ago

Hello Duncan,

Thank you for taking the time to report this issue. We have made this treatment because, in our testing, some numerical features with low numbers of unique values yielded suboptimal results when used with the continuous univariate drift methods. We saw that even if a variable is strictly continuous, if; for some reason, the actual unique values present in that variable are low, then categorical univariate drift methods more accurately described the observed drift. However, as with many things in data, this is situational and could be suboptimal for other cases. From your description, it looks like you may have such a situation with your dataset. I wonder if you can share more about your dataset and how you used it (for example which drift method yielded large drift scores) or create code that creates a similar synthetic reproducible example. It would help to see if there our critertion for treating some variables as categoricals could be updated to accommodate your use case, or if it completely fails there. After that, it would be easier to consider how to update the library. I doubt though, that we would want to completely remove current behavior like you recommend at #398.

Duncan-Hunter commented 5 months ago

Hi, thanks for getting back to me.

That's a good reason for doing that, and yeah the user should consider using a categorical method. In this case (small number of unique values), does JS as a categorical method work well enough? Should the user be informed that other methods might be more appropriate?

I 100% think that there should be at least a warning during fitting that the feature is being treated as categorical.

A potential fix in this scenario, is to round the incoming floats to their closest bin value? It's not really ideal but it can be done.

Here's a use case where I run into problems. I have a reference dataset with a small number of floating point values, and for the sake of argument, they've all been shifted by a tiny amount in analysis. The drift calculator then returns 1 for every chunk despite the change being very small. The change can be even smaller, of course, and still yield this result.

from nannyml.drift import UnivariateDriftCalculator
import numpy as np
import pandas as pd

reference_data = pd.DataFrame(data={
    "x": np.random.randint(low=5, high=8, size=10_000)})

analysis_data = pd.DataFrame(data={
    "x": np.random.randint(low=5, high=8, size=6_000)})

reference_data["x"] = reference_data["x"].astype(float)
analysis_data["x"] = np.clip(analysis_data["x"].astype(float) + 0.01, a_min=5, a_max=8)

calculator = UnivariateDriftCalculator(
    column_names=["x"],
    continuous_methods=['jensen_shannon'],
    chunk_size=1_000
)
calculator = calculator.fit(reference_data)
results = calculator.calculate(analysis_data)
print("Continuous column names: ", calculator.continuous_column_names)
print(calculator._column_to_models_mapping['x'][0]._treat_as_type)
results.filter(period='analysis').to_df(multilevel=True)
Continuous column names:  ['x']
cat
image
nnansters commented 5 months ago

That's an interesting example, we'll take a peek into that.

Duncan-Hunter commented 4 months ago

404