Closed Duncan-Hunter closed 4 months ago
Hey Duncan, quickly wanted to chime in here.
I agree that having this conversion happen more explicitly would be better. We're always walking a fine line between making stuff easy and making it explicit. This should be explicit.
One thing holding us back here however is our current implementation that runs this single calculator on a whole set of columns, for a whole list of univariate drift methods. The changes you proposed definitely work on the Method
level, but we don't expose that interface directly. Method
instance are created within the calculator. So the calculator would have to take in a parameter stating for each column if it is continuous or categorical. As we might deal with potentially hundreds of columns, this might get cumbersome.
We would like to split everything up so you can just create and use Method
instances directly. That way it would be a lot easier to specify what columns should be calculated by each. This change is linked to a big refactor we're designing, but we don't have a timeline for it yet.
Maybe for now we can make the conversion a bit more explicit, or at least not hidden away, by issuing a warning, as you did in your PR?
What do you think?
I like the idea of opening up the Method
instance. If a method is created that isn't possible / works poorly, a warning or error should be raised. The way I think I would want it as a user is for there to be a few levels of granularity.
UnivariateDriftCalculator
, which you can fit on your dataset without specifying data types or methods. It figures out which datatypes you have (which you already do to an extent), and if it should be categorical or continuous, then applies the default method.Method
would give warnings/raise errors if the user has done something they shouldn't. If the user hasn't specified a column, it's detected automatically.{str: List[Method]}
that they've created. If they don't provide a method for a column, it's detected automatically."So the calculator would have to take in a parameter stating for each column if it is continuous or categorical." I think the solution here is to have a good way of finding a default dtype + method without user input, which you're already most of the way there.
For now, a warning is fine. It'd also be nice if the UnivariateDriftCalculator
could update it's continuous/categorical columns post fitting, so iterate over columns, if its JS or Hellinger, check the _treat_as_type
attribute and update accordingly. I'm on holiday from tomorrow and all of next week, so I might not update this PR soon.
In my own work now, I'll implement a check to see if this behaviour is there, then just round the floats to their nearest bin.
Hey @Duncan-Hunter , opened up a "replacement" PR for this. If you're good with it, we can close this one.
Converting a feature that is continuous to a categorical feature, even if there are few unique values, may lead to problems. I think it's better to not do this conversion without a warning, and better to not do it at all.
If a user wants a continuous column with few unique values, they can set it as a categorical column. Happy to discuss it, work through it, be convinced I'm wrong etc.
Related to #397 .