NannyML / nannyml

nannyml: post-deployment data science in python
https://www.nannyml.com/
Apache License 2.0
1.97k stars 140 forks source link

propose solution to remove hidden feature type conversion in Univariate Drift #398

Closed Duncan-Hunter closed 4 months ago

Duncan-Hunter commented 5 months ago

Converting a feature that is continuous to a categorical feature, even if there are few unique values, may lead to problems. I think it's better to not do this conversion without a warning, and better to not do it at all.

If a user wants a continuous column with few unique values, they can set it as a categorical column. Happy to discuss it, work through it, be convinced I'm wrong etc.

Related to #397 .

nnansters commented 5 months ago

Hey Duncan, quickly wanted to chime in here.

I agree that having this conversion happen more explicitly would be better. We're always walking a fine line between making stuff easy and making it explicit. This should be explicit.

One thing holding us back here however is our current implementation that runs this single calculator on a whole set of columns, for a whole list of univariate drift methods. The changes you proposed definitely work on the Method level, but we don't expose that interface directly. Method instance are created within the calculator. So the calculator would have to take in a parameter stating for each column if it is continuous or categorical. As we might deal with potentially hundreds of columns, this might get cumbersome.

We would like to split everything up so you can just create and use Method instances directly. That way it would be a lot easier to specify what columns should be calculated by each. This change is linked to a big refactor we're designing, but we don't have a timeline for it yet.

Maybe for now we can make the conversion a bit more explicit, or at least not hidden away, by issuing a warning, as you did in your PR?

What do you think?

Duncan-Hunter commented 5 months ago

I like the idea of opening up the Method instance. If a method is created that isn't possible / works poorly, a warning or error should be raised. The way I think I would want it as a user is for there to be a few levels of granularity.

"So the calculator would have to take in a parameter stating for each column if it is continuous or categorical." I think the solution here is to have a good way of finding a default dtype + method without user input, which you're already most of the way there.

For now, a warning is fine. It'd also be nice if the UnivariateDriftCalculator could update it's continuous/categorical columns post fitting, so iterate over columns, if its JS or Hellinger, check the _treat_as_type attribute and update accordingly. I'm on holiday from tomorrow and all of next week, so I might not update this PR soon.

In my own work now, I'll implement a check to see if this behaviour is there, then just round the floats to their nearest bin.

nnansters commented 4 months ago

Hey @Duncan-Hunter , opened up a "replacement" PR for this. If you're good with it, we can close this one.