eddyfathi commented 2 weeks ago

I am working on a dataset with multiple tables. I am using featuretools library for feature engineering. One of the tables that is NOT the target dataframe, comes with several columns. Three of three column are related to the conversation: ['rating', 'valid_from', 'valid_to']. I use valid_from as the time_index but am not sure how to incorporate valid_to column. If this was the target dataframe I could have used valid_to as cutoffs but since it's not the target dataframe I don't know how to set up the problem so there is no data leakage.

I also thought of using valid_to as the time_index but again I am not sure how to incorporate valid_from column in that case.

Adi6501 commented 6 days ago

import featuretools as ft

Assuming `es` is your existing entity set

Add the secondary table with 'valid_from' as the time_index

es = ft.EntitySet(id="your_entity_set") es = es.entity_from_dataframe( entity_id="secondary_table", dataframe=secondary_df, # your secondary dataframe index="secondary_id", # primary key of the secondary table time_index="valid_from" # use valid_from as the time_index )

Make sure to filter by 'valid_to' in any relationship between this table and the target table

relationship = ft.Relationship( es["target_table"]["target_id"], # Foreign key in target table es["secondary_table"]["secondary_id"], # Primary key in secondary table ) es = es.add_relationship(relationship)

Filter secondary table to avoid using records where `valid_to` < cutoff

During feature engineering, this will automatically apply the filter to prevent leakage

def filter_valid_rows(df, cutoff_time): return df[(df['valid_to'] >= cutoff_time)]

es["secondary_table"] = es["secondary_table"].df.groupby('secondary_id').apply(filter_valid_rows)

Use the filtered data in DFS

feature_matrix, feature_defs = ft.dfs( entityset=es, target_entity="target_table", cutoff_time=cutoff_times_df, # DataFrame containing cutoffs for each instance features_only=False )

Adi6501 commented 6 days ago

This should help u , if u have any questions u can reach out to me

alteryx / featuretools

how to add a dataframe that rows are valid for a period of time with featuretools #2756

Assuming `es` is your existing entity set

Add the secondary table with 'valid_from' as the time_index

Make sure to filter by 'valid_to' in any relationship between this table and the target table

Filter secondary table to avoid using records where `valid_to` < cutoff

During feature engineering, this will automatically apply the filter to prevent leakage

Use the filtered data in DFS

alteryx / featuretools

how to add a dataframe that rows are valid for a period of time with featuretools #2756

Assuming es is your existing entity set

Add the secondary table with 'valid_from' as the time_index

Make sure to filter by 'valid_to' in any relationship between this table and the target table

Filter secondary table to avoid using records where valid_to < cutoff

During feature engineering, this will automatically apply the filter to prevent leakage

Use the filtered data in DFS

Assuming `es` is your existing entity set

Filter secondary table to avoid using records where `valid_to` < cutoff