alteryx / open_source_demos

A collection of demos showcasing automated feature engineering and machine learning in diverse use cases
BSD 3-Clause "New" or "Revised" License
496 stars 167 forks source link

create features on one dataset #11

Open billy-odera opened 5 years ago

billy-odera commented 5 years ago

I have tried to created automated features using only one dataset but it doesnt work. Does it mean I can only use feature tools when I have two or more datasets. The code is as below:

create entity

es = ft.EntitySet(id = 'clients')

create entity of the dataset

es = es.entity_from_dataframe(entity_id = 'app', dataframe = data, index ='customerid')

Default primitives from featuretools

default_agg_primitives = ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"] default_trans_primitives = ["day", "year", "month", "weekday", "haversine", "numwords", "characters"]

DFS with specified primitives

feature_matrix, feature_names = ft.dfs(entityset = es, target_entity = 'app', trans_primitives = default_trans_primitives, agg_primitives=default_agg_primitives, max_depth = 2, features_only=False, verbose = True)

print('%d Total Features' % len(feature_names))

This returns same number of features in the dataframe. No new features created

kmax12 commented 5 years ago

@billy-odera can you provide an example of a feature you would expect to get created using just that one table?

billy-odera commented 5 years ago

@kmax12 This is the dataframe

customerid  age    outflows_amout  inflows_amount
1               28.00                    0                  355.00  
2               72.00                    1               240.00 
3               22.00                    6                nan

I would expect to get count.outflow_amount, mean,skew etc

kmax12 commented 5 years ago

@billy-odera not sure i follow your example. if you want to calculate the mean outflows_amount per customer, you would want to create a second entity for your customers that has a relationship to a the entity with multiple rows per customer with different outflow_amounts. let me know if that's helpful or please provide a complete example of what you want to generate so I can better help.

shellwang commented 5 years ago

Yes. I encounter the same problem.

bukosabino commented 5 years ago

Hi @shellwang ,

Can you provide us more details about your goals?

As Max says, you need more related tables to extract this kind of features.

turkialjrees commented 5 years ago

I belefie the issue here is to understand the fundamental of Automatie ML methods, whcih is A transformation acts on a single table (thinking in terms of Python, a table is just a Pandas DataFrame ) by creating new features out of one or more of the existing columns. Like many topics in machine learning, automated feature engineering is a complicated concept built on simple ideas. Using concepts of entitysets, entities, and relationships, featuretools can perform deep feature synthesis to create new features. Deep feature synthesis in turn stacks feature primitives — aggregations, which act across a one-to-many relationship between tables, and transformations, functions applied to one or more columns in a single table — to build new features from multiple tables.

read more with basic example here https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219