alteryx / autonormalize

python library for automated dataset normalization
https://blog.featurelabs.com/automatic-dataset-normalization-for-feature-engineering-in-python/
BSD 3-Clause "New" or "Revised" License
109 stars 16 forks source link

Variable types not preserved after call to normalize_entity() #10

Open j-grover opened 4 years ago

j-grover commented 4 years ago

Reproducible example:

import pandas as pd
import featuretools as ft

from featuretools.variable_types import IPAddress
from autonormalize import autonormalize as an

input_df = pd.DataFrame(
    {
        'ip_address': ['128.101.101.101', '1.120.0.0', '17.86.21.0', '23.1.23.255'],
        'length': [900, 60, 20, 30],
        'city': ['adl', 'syd', 'adl', 'syd'],
        'country': ['aus', 'aus', 'aus', 'aus'],
        'is_threat': [True, False, False, False]
    }
)

variable_types = {'ip_address': IPAddress}

es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data',
                         dataframe=input_df,
                         index='index',
                         variable_types=variable_types,
                         make_index=True)

Column ip_address is set to dtype featuretools.variable_types.IPAddress:

print(es['data'].variables)

[<Variable: index (dtype = index)>, 
<Variable: length (dtype = numeric)>, 
<Variable: city (dtype = categorical)>, 
<Variable: country (dtype = categorical)>, 
<Variable: is_threat (dtype = boolean)>, 
<Variable: ip_address (dtype = ip)>]

After normalisation, ip_address resolves back to categorical:

normalized_es = an.normalize_entity(es)

for entity in normalized_es.entity_dict:
    print(normalized_es.entity_dict[entity].variables)
Entity: index
[<Variable: index (dtype = index)>, 
<Variable: length (dtype = numeric)>, 
<Variable: city (dtype = id)>, 
<Variable: is_threat (dtype = boolean)>, 
<Variable: ip_address (dtype = categorical)>]
Entity: city
[<Variable: city (dtype = index)>, <Variable: country (dtype = categorical)>]

To get the desired features, the variable types need to be preserved so the right primitives can be applied when running dfs. My question is whether this should be the desired behaviour or do the variable types need to be set manually again?

kmax12 commented 4 years ago

@j-grover is this an issue with autonormalize or Featuretools? If featuretools, please post as an issue that that repo: https://github.com/featuretools/featuretools/

j-grover commented 4 years ago

@kmax12

For reference: autonormalize.py

The normalization of a EntitySet follows the following call graph: normalize_entity -> auto_entityset -> make_entityset

According to my understanding, the variable types are not carried forward from normalize_entity to auto_entityset. So when entities are created in make_entityset, we do not have variable types:

if time_index in current.df.columns:
    entities[current.index[0]] = (current.df, current.index[0], time_index)
else:
    entities[current.index[0]] = (current.df, current.index[0])

Entities definition:

"""
entities (dict[str -> tuple(pd.DataFrame, str, str)]): Dictionary of
                    entities. Entries take the format
                    {entity id -> (dataframe, id column, (time_column), (variable_types))}.
                    Note that time_column and variable_types are optional.
"""
kmax12 commented 4 years ago

@j-grover thanks for clarification. I see the issue now.

you're right that we aren't carrying the variable types through. would you be interested in submitting a PR that does that?

j-grover commented 4 years ago

@j-grover thanks for clarification. I see the issue now.

you're right that we aren't carrying the variable types through. would you be interested in submitting a PR that does that?

Yeah sure, I'll give it a go.

j-grover commented 4 years ago

@kmax12 I have a branch ready, I believe I do not have access to push.

kmax12 commented 4 years ago

@j-grover can you create a fork to make the pull request?

j-grover commented 4 years ago

@j-grover can you create a fork to make the pull request?

Thanks, created PR.