NannyML / nannyml

nannyml: post-deployment data science in python
https://www.nannyml.com/
Apache License 2.0
1.88k stars 136 forks source link

No support for some Pandas Extension Dtypes #399

Open Duncan-Hunter opened 3 months ago

Duncan-Hunter commented 3 months ago

Describe the bug Pandas has extension DTypes. When you fit a Univariate calculator, or presumably anything that else that checks for dtypes using _split_features_by_type, columns are dropped because Int64 is not in

[
        'int_',
        'int8',
        'int16',
        'int32',
        'int64',
        'uint8',
        'uint16',
        'uint32',
        'uint64',
        'float_',
        'float16',
        'float32',
        'float64',
    ]

To Reproduce Using an environment with nannyml=0.10.7

import numpy as np
import pandas as pd

num_dtypes = [
    'int_',
    'int8',
    'int16',
    'int32',
    'int64',
    'uint8',
    'uint16',
    'uint32',
    'uint64',
    'float_',
    'float16',
    'float32',
    'float64',
    ]

test = pd.Series([1, 2, 3, 4, 5], dtype='Int64')

print("In num_dtypes: ", test.dtype in num_dtypes)
print("in ['Int64']: ", test.dtype in ['Int64'])
print("dtype: ", test.dtype)

test = test.astype(test.dtype.type)

print("new dtype: ", test.dtype)
print("In num_dtypes: ", test.dtype in num_dtypes)
In num_dtypes:  False
in ['Int64']:  True
dtype:  Int64
new dtype:  int64
In num_dtypes:  True

Expected behavior There should be support for these dtypes, and columns shouldn't be dropped without the user knowing.

Additional context I'm going to work around the issue by converting my datatypes to underlying numpy types using pd.Series.dtype.type. But for a fix, I think you should use np.issubdtype(dtype.type, np.number).

nnansters commented 3 months ago

Hey @Duncan-Hunter ,

good catch, good suggestion. I'll take a look into the np.issubdtype function for a cleaner solution.

Worst case scenario we can always add the extension dtypes to the list above.

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.