Closed nnansters closed 4 months ago
Attention: Patch coverage is 96.00000%
with 4 lines
in your changes missing coverage. Please review.
Project coverage is 77.01%. Comparing base (
8bffbcd
) to head (336a056
).
Files | Patch % | Lines |
---|---|---|
nannyml/drift/univariate/methods.py | 94.59% | 2 Missing and 2 partials :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks for this! Yes that looks like a better way of doing things. Happy to close the other PR.
Thanks for this! Yes that looks like a better way of doing things. Happy to close the other PR.
Thanks once again for bringing this to our attention, and putting in the effort. Much appreciated!
This PR addresses the implicit data conversion occurring in the
hellinger
andjensen_shannon
univariate drift methods. Even when a column is explicitly marked as categorical, these methods will still use thedtype
to heuristically decide on treating them as continuous or categorical, thereby completely ignoring any explicitly given information.In this PR I've split up the implementation of the
JensenShannonDistance
andHellingerDistance
classes into two separate classes, one for continuous features and one for categorical features. They are then both added to the registry using the same name. This ensures that "routing" will still just work. I.e. the namejensen_shannon
orhellinger
will yield one of these implementations, depending on the providedFeatureType
.By splitting these implementations we can remove all of the logic related to determining the feature type, as it is now just a given. To be honest, this is how the implementation should've been from the beginning. Combining continuous and categorical calculation into a single class was a bad idea.
These Method instances are being created in the
UnivariateDriftCalculator
during fitting. For continuous features, the continuous version of the Method is instantiated and the same goes for categorical features. To allow further control over this, I've added atreat_as_continuous
parameter to theUnivariateDriftCalculator
. It allows you to pass a list of columns that should always be treated as continuous, similar to the already existingtreat_as_categorical
parameter.So the entire flow is now as follows:
treat_as_continuous
ortreat_as_categorical
parameters.MethodFactory.create
method.The logic for splitting up the column names into a list of continuous and categorical ones has grown even larger, so I've refactored that into a separate helper method.
This should aid in some of the points raised in #398. What do you think @Duncan-Hunter?