alteryx / featuretools

An open source python library for automated feature engineering
https://www.featuretools.com
BSD 3-Clause "New" or "Revised" License
7.28k stars 878 forks source link

Look-Ahead Bias in Generated Features #2731

Closed Nasser-Alkhulaifi closed 6 months ago

Nasser-Alkhulaifi commented 6 months ago

Hi,

I've noticed that some of the generated features by Featuretools exhibit look-ahead bias, which is critical and must be avoided in machine learning regression problems. Specifically, the features in X_train contain exact values that represent the same row in y_train, leading to data leakage?

Example: In the attached screenshot, you can see that X_train (features) includes values that are present in the same row as y_train. This creates look-ahead bias. Such features (e.g., lags or rolling statistical window features etc.) should be shifted to ensure only available data at the forecasting time is used for prediction.

Questions:

  1. Why does this look-ahead bias exist in the generated features?
  2. Am I using the tool incorrectly?
  3. Is there a specific setting or method I am missing to avoid this issue?

Thank you.

FT

`import featuretools as ft import pandas as pd from featuretools.primitives import list_primitives

df = pd.read_csv(r"xxxxxxxC.csv") df['DateTime'] = pd.to_datetime(df['DateTime'])

Create an EntitySet es = ft.EntitySet(id="data")

Add the DataFrame to the EntitySet es = es.add_dataframe(dataframe_name="df", dataframe=df, index="index", make_index=True, time_index="DateTime")

List all available primitives primitives = list_primitives() agg_primitives = primitives[primitives['type'] == 'aggregation']['name'].tolist() trans_primitives = primitives[primitives['type'] == 'transform']['name'].tolist()

Run to create new features feature_matrix, feature_defs = ft.dfs( entityset=es, target_dataframe_name="df", agg_primitives=agg_primitives, # Use all aggregation primitives trans_primitives=trans_primitives # Use all transformation primitives )

feature_matrix`

thehomebrewnerd commented 6 months ago

@Nasser-Alkhulaifi By default Featuretools is going to attempt to generate features from every column in the input dataframe that you provide. It has no way of knowing that it shouldn't generate features for a given column that is present in the data unless you instruct it to ignore the column.

There are multiple ways you can handle this based on your particular problem. For example, you can simply drop the column from your dataframe before creating the EntitySet. You could also use the ignore_columns argument when calling ft.dfs to tell Featuretools to not generate features from that column.

Nasser-Alkhulaifi commented 6 months ago

@thehomebrewnerd thank you for your quick response!

I appreciate your suggestions on how to exclude columns to avoid look-ahead bias. However, my concern is not about ignoring specific columns. My issue is related to the inherent look-ahead bias in the generated features and the lack of appropriate shifting to avoid this bias.

As you know, in time series forecasting, it is crucial to ensure that the features used for prediction do not include future information relative to the target variable. This means that features such as lags, rolling statistical windows ets. need to be shifted appropriately so that only past data up to the prediction time is used!

From my observations, some of the generated features by Featuretools include exact values that correspond to the same row in the target variable (y_train). This introduces look-ahead bias and leads to data leakage, as the model gets access to future information that would not be available at prediction time!

For instance, consider using lag1 as a feature (which must be shifted one step back) to avoid being on the same row/index as the target variable y_train. The first row of any feature generated from the target variable should have NaN for this feature because it has been shifted and can't be used at prediction time (t0) as this information won't be known!

image

Does my point make sense? Is this clear to you and can it be added as a feature?

To put it simply, we need to avoid aligning any features that have information that won't be available at forecasting time on the same row/index as the target variable. I know I can work around this after the new dataframe of generated features is created, but I'm looking for a method or setting in Featuretools that ensures only past data is considered when creating features for time-series forecasting tasks.

Thank you!

thehomebrewnerd commented 6 months ago

@Nasser-Alkhulaifi Yes, your point makes sense. Featuretools has a set of primitives for creating features for time-series problems. Take a look at this guide for more information: https://featuretools.alteryx.com/en/stable/guides/time_series.html

Nasser-Alkhulaifi commented 6 months ago

Thank you @thehomebrewnerd

thehomebrewnerd commented 6 months ago

Closing this for now. Feel free to reopen if you encounter additional problems or find behavior in Featuretools that seems incorrect.