NannyML / nannyml

nannyml: post-deployment data science in python
https://www.nannyml.com/
Apache License 2.0
1.97k stars 139 forks source link

Explicit data type conversion for univariate drift #404

Closed nnansters closed 4 months ago

nnansters commented 4 months ago

This PR addresses the implicit data conversion occurring in the hellinger and jensen_shannon univariate drift methods. Even when a column is explicitly marked as categorical, these methods will still use the dtype to heuristically decide on treating them as continuous or categorical, thereby completely ignoring any explicitly given information.

In this PR I've split up the implementation of the JensenShannonDistance and HellingerDistance classes into two separate classes, one for continuous features and one for categorical features. They are then both added to the registry using the same name. This ensures that "routing" will still just work. I.e. the name jensen_shannon or hellinger will yield one of these implementations, depending on the provided FeatureType.

@MethodFactory.register(key='jensen_shannon', feature_type=FeatureType.CONTINUOUS)
class ContinuousJensenShannonDistance(Method):

By splitting these implementations we can remove all of the logic related to determining the feature type, as it is now just a given. To be honest, this is how the implementation should've been from the beginning. Combining continuous and categorical calculation into a single class was a bad idea.

These Method instances are being created in the UnivariateDriftCalculator during fitting. For continuous features, the continuous version of the Method is instantiated and the same goes for categorical features. To allow further control over this, I've added a treat_as_continuous parameter to the UnivariateDriftCalculator. It allows you to pass a list of columns that should always be treated as continuous, similar to the already existing treat_as_categorical parameter.

So the entire flow is now as follows:

  1. Create a calculator, explicitly marking some columns as continuous and some as categorical using the treat_as_continuous or treat_as_categorical parameters.
  2. During fitting (when we have data available), we'll heuristically determine the feature type of any features that were not explicitly set.
  3. We then create the lists of drift methods, by instantiating them using the MethodFactory.create method.

The logic for splitting up the column names into a list of continuous and categorical ones has grown even larger, so I've refactored that into a separate helper method.

This should aid in some of the points raised in #398. What do you think @Duncan-Hunter?

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 96.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 77.01%. Comparing base (8bffbcd) to head (336a056).

Files Patch % Lines
nannyml/drift/univariate/methods.py 94.59% 2 Missing and 2 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #404 +/- ## ========================================== + Coverage 76.98% 77.01% +0.03% ========================================== Files 108 108 Lines 9267 9297 +30 Branches 1656 1652 -4 ========================================== + Hits 7134 7160 +26 - Misses 1674 1676 +2 - Partials 459 461 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Duncan-Hunter commented 4 months ago

Thanks for this! Yes that looks like a better way of doing things. Happy to close the other PR.

nnansters commented 4 months ago

Thanks for this! Yes that looks like a better way of doing things. Happy to close the other PR.

Thanks once again for bringing this to our attention, and putting in the effort. Much appreciated!