blue-yonder / tsfresh

Automatic extraction of relevant features from time series:
http://tsfresh.readthedocs.io
MIT License
8.38k stars 1.21k forks source link

Look-Ahead Bias in Generated Features #1074

Open Nasser-Alkhulaifi opened 4 months ago

Nasser-Alkhulaifi commented 4 months ago

Hi,

I've noticed that some of the generated features exhibit look-ahead bias, which is critical and must be avoided in machine learning regression problems. Specifically, the features in X_train contain exact values that represent the same row in y_train, leading to data leakage?

Example: In the attached screenshot, you can see that X_train (features) includes values that are present in the same row as y_train. This creates look-ahead bias. Such features (e.g., lags or rolling statistical window features etc.) should be shifted to ensure only available data at the forecasting time is used for prediction.

Questions:

Why does this look-ahead bias exist in the generated features? Am I using the tool incorrectly? Is there a specific setting or method I am missing to avoid this issue?

Thank you.

image

nils-braun commented 4 months ago

Hi @Nasser-Alkhulaifi - you are correct, look-ahead bias is not good. tsfresh comes with a toolkit for managing forecasting datasets (https://tsfresh.readthedocs.io/en/latest/text/forecasting.html), which allow you to define which data should be taken into account when calculating the features. I do not know how you used tsfresh but if you use the methods documented in the link, you should not get any look-ahead bias (because tsfresh just can not see the more recent data)

Nasser-Alkhulaifi commented 3 months ago

Hi @nils-braun

Apologies for the delayed response and thanks for sharing this.

I understand the effectiveness of rolling windows in preventing look-ahead bias, but the need to manually specify parameters such as max_timeshift seems to contradict the goal of automated feature extraction. The requirement for users to determine these parameters manually introduces a level of complexity/user intervention that may not align with the intended ease-of-use and automation that TSfresh aims to provide.

do you see what I mean or am I missing something here?

so I'm just wondering; is there a possibility to incorporate a more dynamic approach within TSfresh to automatically determine these parameters, thus maintaining the ease-of-use and automation TSfresh aims to provide?

Thank you again for this great package!

nils-braun commented 3 months ago

Hi @Nasser-Alkhulaifi,

do you see what I mean

Yes, I think I understand (although it is different from your first post, because this is on UX and not on a look-ahead bias - but that does not mean it is less important! So maybe my first answer was not relevant to your question), but I do think that the defaults of the methods are chosen in a way which allows for most users to not change them. Happy to learn more if you think this is not the case, but let me explain:

Maybe there is a misunderstanding in how to use the function, so let me give some details. The main method to roll time series is roll_time_series. Except the usual column parameters and some configuration for multiprocessing etc. (which is the same also for the extract function), this method has three rolling-related parmeters:

Maybe I misunderstood, but as a summary I do think that you can use roll_time_series without much expert knowledge in most of the cases (because you do not need to change the default values, but you should still have all options if you want). Happy to discuss this if you think that you have a use-case that does not fit here.

PS: if you are asking why we do not apply the roll_time_series method automatically: not everyone has a forecasting use-case, so we wanted to split this functionality.

Nasser-Alkhulaifi commented 3 months ago

Apologies for the delayed response @nils-braun and thank you for the detailed explanation - really appreciate it.

I'll have a look and get back to you if I have any further questions/comments.