alteryx / featuretools

An open source python library for automated feature engineering
https://www.featuretools.com
BSD 3-Clause "New" or "Revised" License
7.27k stars 878 forks source link

how to add a dataframe that rows are valid for a period of time with featuretools #2756

Open eddyfathi opened 2 weeks ago

eddyfathi commented 2 weeks ago

I am working on a dataset with multiple tables. I am using featuretools library for feature engineering. One of the tables that is NOT the target dataframe, comes with several columns. Three of three column are related to the conversation: ['rating', 'valid_from', 'valid_to']. I use valid_from as the time_index but am not sure how to incorporate valid_to column. If this was the target dataframe I could have used valid_to as cutoffs but since it's not the target dataframe I don't know how to set up the problem so there is no data leakage.

I also thought of using valid_to as the time_index but again I am not sure how to incorporate valid_from column in that case.

Adi6501 commented 6 days ago

import featuretools as ft

Assuming es is your existing entity set

Add the secondary table with 'valid_from' as the time_index

es = ft.EntitySet(id="your_entity_set") es = es.entity_from_dataframe( entity_id="secondary_table", dataframe=secondary_df, # your secondary dataframe index="secondary_id", # primary key of the secondary table time_index="valid_from" # use valid_from as the time_index )

Make sure to filter by 'valid_to' in any relationship between this table and the target table

relationship = ft.Relationship( es["target_table"]["target_id"], # Foreign key in target table es["secondary_table"]["secondary_id"], # Primary key in secondary table ) es = es.add_relationship(relationship)

Filter secondary table to avoid using records where valid_to < cutoff

During feature engineering, this will automatically apply the filter to prevent leakage

def filter_valid_rows(df, cutoff_time): return df[(df['valid_to'] >= cutoff_time)]

es["secondary_table"] = es["secondary_table"].df.groupby('secondary_id').apply(filter_valid_rows)

Use the filtered data in DFS

feature_matrix, feature_defs = ft.dfs( entityset=es, target_entity="target_table", cutoff_time=cutoff_times_df, # DataFrame containing cutoffs for each instance features_only=False )

Adi6501 commented 6 days ago

This should help u , if u have any questions u can reach out to me