Look-Ahead Bias in Generated Features

Nasser-Alkhulaifi commented 5 months ago

Hi,

I've noticed that some of the generated features exhibit look-ahead bias, which is critical and must be avoided in machine learning regression problems. Specifically, the features in X_train contain exact values that represent the same row in y_train, leading to data leakage?

Example: In the attached screenshot, you can see that X_train (features) includes values that are present in the same row as y_train. This creates look-ahead bias. Such features (e.g., lags or rolling statistical window features etc.) should be shifted to ensure only available data at the forecasting time is used for prediction.

Questions:

Why does this look-ahead bias exist in the generated features? Am I using the tool incorrectly? Is there a specific setting or method I am missing to avoid this issue?

Thank you.

nils-braun commented 5 months ago

Hi @Nasser-Alkhulaifi - you are correct, look-ahead bias is not good. tsfresh comes with a toolkit for managing forecasting datasets (https://tsfresh.readthedocs.io/en/latest/text/forecasting.html), which allow you to define which data should be taken into account when calculating the features. I do not know how you used tsfresh but if you use the methods documented in the link, you should not get any look-ahead bias (because tsfresh just can not see the more recent data)

Nasser-Alkhulaifi commented 5 months ago

Hi @nils-braun

Apologies for the delayed response and thanks for sharing this.

I understand the effectiveness of rolling windows in preventing look-ahead bias, but the need to manually specify parameters such as max_timeshift seems to contradict the goal of automated feature extraction. The requirement for users to determine these parameters manually introduces a level of complexity/user intervention that may not align with the intended ease-of-use and automation that TSfresh aims to provide.

do you see what I mean or am I missing something here?

so I'm just wondering; is there a possibility to incorporate a more dynamic approach within TSfresh to automatically determine these parameters, thus maintaining the ease-of-use and automation TSfresh aims to provide?

Thank you again for this great package!

nils-braun commented 5 months ago

Hi @Nasser-Alkhulaifi,

do you see what I mean

Yes, I think I understand (although it is different from your first post, because this is on UX and not on a look-ahead bias - but that does not mean it is less important! So maybe my first answer was not relevant to your question), but I do think that the defaults of the methods are chosen in a way which allows for most users to not change them. Happy to learn more if you think this is not the case, but let me explain:

Maybe there is a misunderstanding in how to use the function, so let me give some details. The main method to roll time series is roll_time_series. Except the usual column parameters and some configuration for multiprocessing etc. (which is the same also for the extract function), this method has three rolling-related parmeters:

rolling_direction: its sign is just used for nomenclature of the resulting ids and its absolute value is the amount of shifting. For 90% of the use cases (maybe even more), the default value of 1 makes most sense.
max_timeshift: this defines how long at maximum a resulting time series will be. Should we look 10 steps in the past or 100? In principle you could say, that using anything else than None (the default) is mostly an optimization for runtime, because longer time series take longer to compute and in principle it does not hurt to look back longer in the past. If you want to set this value (because of runtime optimization), the actual value depends on your use-case: for something which depends e.g. on day of the year, it might make sense to take the last year into account. Or maybe the last two years? Only the physical process behind the data can tell you. But as I said: I personally see this mostly as a runtime optimization.
min_timeshift: similar to above, but this time the minimum length. Probably even less important to change, because including those short time series or not does not change the runtime much. Keeping the very short time series will probably mean that the predicted values from these timeseries will be non-sense (because they are very short). But what "too short" means is again dependent on the physical process behind the time series.

Maybe I misunderstood, but as a summary I do think that you can use roll_time_series without much expert knowledge in most of the cases (because you do not need to change the default values, but you should still have all options if you want). Happy to discuss this if you think that you have a use-case that does not fit here.

PS: if you are asking why we do not apply the roll_time_series method automatically: not everyone has a forecasting use-case, so we wanted to split this functionality.

Nasser-Alkhulaifi commented 4 months ago

Apologies for the delayed response @nils-braun and thank you for the detailed explanation - really appreciate it.

I'll have a look and get back to you if I have any further questions/comments.

blue-yonder / tsfresh

Look-Ahead Bias in Generated Features #1074