alteryx / woodwork

Woodwork is a Python library that provides robust methods for managing and communicating data typing information.
https://woodwork.alteryx.com
BSD 3-Clause "New" or "Revised" License
145 stars 20 forks source link

Series with PostalCode logical type can have `float` or `str` elements. #1577

Open sbadithe opened 1 year ago

sbadithe commented 1 year ago

Series with PostalCode logical type can have float or str elements.

For example,

ser = pd.Series([12345, 67890]).astype('category')
ser = ww.init_series(ser, logical_type='PostalCode') 

In the above code block, the elements of the series are floats, but in the following, they are strings:

ser = pd.Series(["12345", "67890"]).astype('category')
ser = ww.init_series(ser, logical_type='PostalCode')

Both are valid initializations. We should decide whether we want to support both data types for the PostalCode logical type.

This issue was discussed here. https://github.com/alteryx/featuretools/pull/2365

thehomebrewnerd commented 1 year ago

Just to add a little more, I think part of the inconsistent/confusing behavior is if you take a series that has numeric values, but not a category dtype, and initialize with the PostalCode logical type, the numeric values get converted to strings:

>>> ser = pd.Series([12345, 67890])
>>> ser = ww.init_series(ser, logical_type='PostalCode')
>>> type(ser[0])
<class 'str'>

But if you start with the same values and set the type as category before WW init, you end up with numeric values instead of strings:

>>> ser = pd.Series([12345, 67890]).astype("category")
>>> ser = ww.init_series(ser, logical_type='PostalCode')
>>> type(ser[0])
<class 'numpy.int64'>

I believe WW should provide a consistent output in this case, so that no matter the input dtype type we have the same type used in the output after WW initialization.