Duncan-Hunter commented 5 months ago

When using JS distance with UnivariateDriftCalculator with a small number of unique values in a continuous column, currently the library decides to treat it as categorical, which I can sort of understand? However, if the user knows that a feature is continuous and wants it to be treated as such, there is no option. The problem I'm seeing is that a small change in values in analysis leads to a large drift score, because these floats aren't equal to the categories.

Describe the solution you'd like

Simply don't do this conversion from continuous to categorical - if a column isn't in treat_as_categorical, then don't assume it's categorical. Let the user decide. Maybe add something to the docs or a warning if the number of unique values is low.

Describe alternatives you've considered

Add an argument to UnivariateDriftCalculator called treat_as_continuous: List[str] or convert_continuous: bool something else, that when set is passed to each Method.
If a np.number type is passed to a method with treat_as_categorical set to True, round the data to the closest categories.

nikml commented 5 months ago

Hello Duncan,

Thank you for taking the time to report this issue. We have made this treatment because, in our testing, some numerical features with low numbers of unique values yielded suboptimal results when used with the continuous univariate drift methods. We saw that even if a variable is strictly continuous, if; for some reason, the actual unique values present in that variable are low, then categorical univariate drift methods more accurately described the observed drift. However, as with many things in data, this is situational and could be suboptimal for other cases. From your description, it looks like you may have such a situation with your dataset. I wonder if you can share more about your dataset and how you used it (for example which drift method yielded large drift scores) or create code that creates a similar synthetic reproducible example. It would help to see if there our critertion for treating some variables as categoricals could be updated to accommodate your use case, or if it completely fails there. After that, it would be easier to consider how to update the library. I doubt though, that we would want to completely remove current behavior like you recommend at #398.

Duncan-Hunter commented 5 months ago

Hi, thanks for getting back to me.

That's a good reason for doing that, and yeah the user should consider using a categorical method. In this case (small number of unique values), does JS as a categorical method work well enough? Should the user be informed that other methods might be more appropriate?

I 100% think that there should be at least a warning during fitting that the feature is being treated as categorical.

A potential fix in this scenario, is to round the incoming floats to their closest bin value? It's not really ideal but it can be done.

Here's a use case where I run into problems. I have a reference dataset with a small number of floating point values, and for the sake of argument, they've all been shifted by a tiny amount in analysis. The drift calculator then returns 1 for every chunk despite the change being very small. The change can be even smaller, of course, and still yield this result.

from nannyml.drift import UnivariateDriftCalculator
import numpy as np
import pandas as pd

reference_data = pd.DataFrame(data={
    "x": np.random.randint(low=5, high=8, size=10_000)})

analysis_data = pd.DataFrame(data={
    "x": np.random.randint(low=5, high=8, size=6_000)})

reference_data["x"] = reference_data["x"].astype(float)
analysis_data["x"] = np.clip(analysis_data["x"].astype(float) + 0.01, a_min=5, a_max=8)

calculator = UnivariateDriftCalculator(
    column_names=["x"],
    continuous_methods=['jensen_shannon'],
    chunk_size=1_000
)
calculator = calculator.fit(reference_data)
results = calculator.calculate(analysis_data)
print("Continuous column names: ", calculator.continuous_column_names)
print(calculator._column_to_models_mapping['x'][0]._treat_as_type)
results.filter(period='analysis').to_df(multilevel=True)

Continuous column names:  ['x']
cat

nnansters commented 5 months ago

That's an interesting example, we'll take a peek into that.

Duncan-Hunter commented 4 months ago

NannyML / nannyml

change assumed `treat_as_categorical` #397

404