blue-yonder / tsfresh

Automatic extraction of relevant features from time series:
http://tsfresh.readthedocs.io
MIT License
8.22k stars 1.21k forks source link

Avoid leaking indices from training data sets as feature, classification accuracy depends on order of input time series in data frame #162

Closed fkirc closed 7 years ago

fkirc commented 7 years ago

I attempt to use tsfresh for a simple binary classification using a k-nearest-neighbor-classifier and k-fold-validation. However, the classification accuracy depends on the order of the input time series, which should not be relevant at all.

The underlying problem are the features selected by select_features: value__index_mass_quantile__q_0.8 value__index_mass_quantile__q_0.7 value__index_mass_quantile__q_0.2 value__index_mass_quantile__q_0.3 and so on. All of them are directly proportional to the id in the training data set.

Now the k-nearest-neighbor classifier just has to decide whether these index "features" are above a certain threshold to make a correct classification.

I need to disable the consideration of the index for feature extraction. Using the index of the samples in my training data as input for feature extraction reduces my model to absurdity. All features should be only based on the time stamps and the associated values, but not on the order of the samples in my input data.

How can I disable this incorrect behavior?

extracted_features = extract_features(time_series, column_id="id", column_sort="time", column_value="value")
impute(extracted_features)
features_filtered = select_features(extracted_features, y) # use features_filtered and y as input for k-fold validation

The time_series data frame is constructed in the same way as the robots example:

id time value 0 0 1 760 1 0 11 761 2 0 466 761 3 0 473 765 4 0 481 763 5 0 488 761 6 0 516 763 7 0 532 763 8 0 542 756 9 0 610 756 10 0 618 757 11 0 885 757 12 0 1189 757 13 0 1206 758 14 0 1263 758 15 0 1275 760 16 0 1295 768 17 1 1 760 18 1 11 761 19 1 466 761 20 1 473 765 21 1 481 763 22 1 488 761 23 1 516 763 .. .. ... ... 538 31 885 757 539 31 1189 757 540 31 1206 758 541 31 1263 758 542 31 1275 760 543 31 5000 768 544 32 1 760 545 32 11 761 546 32 466 761 547 32 473 765 548 32 481 763 549 32 488 761 550 32 516 763 551 32 532 763 552 32 542 756 553 32 610 756 554 32 618 757 555 32 885 757 556 32 1189 757 557 32 1206 758 558 32 1263 758 559 32 1275 760 560 32 5000 768

The same goes for the target labels y:

0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25 2 26 2 27 2 28 2 29 2 30 2 31 2 32 2 dtype: int64

MaxBenChrist commented 7 years ago

I don't fully understand the problem here?

Therefore all features should be only based on the time stamps and the associated values, but not on the order of the samples in my input data.

no feature calculator in tsfresh has access to the id column.

(there is a difference between id, which denotes the id of the device, and index, which are the index or time stamps of the tine series)

where does your data originate from?

MaxBenChrist commented 7 years ago

You can also upload an ipython notebook + data so we can have a look at it

fkirc commented 7 years ago

The garbage feature data originates from the indices of the time series. I have 32 time series with indices from 1 to 32. Each of these series consist of several values and time stamps, which are ignored for the valueindex_mass_quantileq_* features

Only these indices from 1 to 32 are considered.

MaxBenChrist commented 7 years ago

which version of tsfresh are you using?

Each of these series consist of several values and time stamps, which are ignored for the valueindex_mass_quantileq_* features

I checked the code but I could not find a leak: So, the function calculators should not have access to the id column if your parameters (column_id="id", column_sort="time", column_value="value") are set correctly.

We can't help you without a minimal example.

fkirc commented 7 years ago

Here is the relevant code, I hope you can spot the leak. The extracted features are literally garbage since they fully depend on the order of the input data samples parsed from the file. I use version 0.6.0, installed with pip.

features.py:

import pandas as pd ...

def get_features(file_record, labels):

    time_series, target_classes = construct_tsfresh_input(file_record, labels)
    extracted_features = extract_features(time_series, column_sort="time", column_value="value", column_id="id")
    impute(extracted_features)
    X = select_features(extracted_features, target_classes)
    Y = pd.Series(labels)
    return X, Y

def construct_tsfresh_input(file_record, labels):
    # Create mapping from string labels to integers
    label_mapping = {}
    label_id = 0
    for item in labels:
        if label_id > 0 and item in label_mapping:
            continue
        else:
            label_id += 1
            label_mapping[item] = label_id

    # Build panda frame and series as input for tsfresh
    id_to_target = {}
    df_rows = []
    cur_id = 0
    for trace in file_record:
        id_to_target[cur_id] = label_mapping[labels[cur_id]]
        for point in trace:
            time_stamp = point[1]
            value = point[0]
            df_rows.append([cur_id, time_stamp, value])
        cur_id += 1

    df = pd.DataFrame(df_rows, columns=['id', 'time', 'value'])
    y = pd.Series(id_to_target)

    return df, y

classifier.py:


def main():

    file_record, labels = input_parser.parse_input() # custom file parse function
    X, Y = ft.get_features(file_record, labels)

    folds = 5
    kf = KFold(n_splits=folds)
    kf.get_n_splits(X)
    summed_accuracy = 0

    for train_index, test_index in kf.split(X):

        X_train, X_test = X.iloc[train_index,], X.iloc[test_index,]
        Y_train, Y_test = Y[train_index], Y[test_index]
        model = KNeighborsClassifier(n_neighbors=3)
        model.fit(X_train,Y_train)
        predictions = model.predict(X_test)

        summed_accuracy += accuracy_score(Y_test, predictions)
        print(confusion_matrix(Y_test,predictions))
        print(classification_report(Y_test,predictions))

    print("Total accuracy: " + str(summed_accuracy / folds))

main()
fkirc commented 7 years ago

These are the garbage features extracted by above code for the 32 time series shown in the original post, as you can see all of them depend on the id: Note that a single white space separates the columns.

valueindex_mass_quantileq_0.9 valueindex_mass_quantileq_0.4 \ id
0 0.941176 0.411765
1 1.941176 1.411765
2 2.941176 2.411765
3 3.941176 3.411765
4 4.941176 4.411765
5 5.941176 5.411765
6 6.941176 6.411765
7 7.941176 7.411765
8 8.941176 8.411765
9 9.941176 9.411765
10 10.941176 10.411765
11 11.941176 11.411765
12 12.941176 12.411765
13 13.941176 13.411765
14 14.941176 14.411765
15 15.941176 15.411765
16 16.941176 16.411765
17 17.941176 17.411765
18 18.941176 18.411765
19 19.941176 19.411765
20 20.941176 20.411765
21 21.941176 21.411765
22 22.941176 22.411765
23 23.941176 23.411765
24 24.941176 24.411765
25 25.941176 25.411765
26 26.941176 26.411765
27 27.941176 27.411765
28 28.941176 28.411765
29 29.941176 29.411765
30 30.941176 30.411765
31 31.941176 31.411765
32 32.941176 32.411765

value__index_mass_quantile__q_0.1  value__index_mass_quantile__q_0.3  \

id
0 0.117647 0.352941
1 1.117647 1.352941
2 2.117647 2.352941
3 3.117647 3.352941
4 4.117647 4.352941
5 5.117647 5.352941
6 6.117647 6.352941
7 7.117647 7.352941
8 8.117647 8.352941
9 9.117647 9.352941
10 10.117647 10.352941
11 11.117647 11.352941
12 12.117647 12.352941
13 13.117647 13.352941
14 14.117647 14.352941
.......

-- Extracted feature columns: Index(['valueindex_mass_quantileq_0.9', 'valueindex_mass_quantileq_0.4', 'valueindex_mass_quantileq_0.1', 'valueindex_mass_quantileq_0.3', 'valueindex_mass_quantileq_0.7', 'valueindex_mass_quantileq_0.8', 'valueindex_mass_quantileq_0.2', 'valueindex_mass_quantileq_0.6'], dtype='object')

MaxBenChrist commented 7 years ago

The code snippets are not enough, we need the data as well.

If you want us to help, you will have to provide a minimal notebook + data.

MaxBenChrist commented 7 years ago

Here is the relevant code, I hope you can spot the leak. The extracted features are literally garbage since they fully depend on the order of the input data samples parsed from the file.

I am still not convinced that this originates from tsfresh

MaxBenChrist commented 7 years ago

X_train, X_test = X.iloc[train_index,], X.iloc[test_index,]

here you should use loc instead of iloc

MaxBenChrist commented 7 years ago

Please reopen this issue when you have provided the minimal example

fkirc commented 7 years ago

Alright I created an ipython example and uploaded it to a github repository: https://github.com/fkirc/tsfresh-time-series-id-leaking-as-features

Apparently I am not able to reopen this issue.

fkirc commented 7 years ago

You can just clone the repo, launch jupyter notebook and execute the first cell, then it should execute the whole code

MaxBenChrist commented 7 years ago

this is not a minimal example. I will have to look though your whole code.

fkirc commented 7 years ago

You can ignore the whole file parsing code, this is just a part extracted from my project. The relevant parts are the k fold validation and the usage of tsfresh itself.

MaxBenChrist commented 7 years ago

a KNeighborsClassifier on features does not really makes sense

fkirc commented 7 years ago

Actually you can also ignore the k fold validation, since the problem is just about the extracted features which are dependant on the id's

fkirc commented 7 years ago

I always see these features containing __index_, I am sure there must be an easy way to prevent this, since it is clearly wrong to use the order of the input time series as feature. (since the index is nothing else than an ascending number assigned by my pandas dataframe creation code for each time series)

Either it is a major issue, or I just overlook something more obvious.

I will try to drop all feature columns that are based on the time series indices before calling select_features

fkirc commented 7 years ago

Do you have some useful example code on how to do multi class classification based on the features selected by tsfresh?

Even the presence of these index based features appears really weird. A classic machine learning pitfall of leaking unintended information to the extracted features.

My sample data set deliberately has only two target labels with exactly the same time series, nevertheless I can achieve 100% classification accuracy with knn if I put the input data in the right order, e.g. that is that all indices labeled as class 1 are lower than indices labeled as class 2.

54 is an important issue, and I would be willing to contribute something If I am able to get the hang of how this is supposed to work.

MaxBenChrist commented 7 years ago

my comment about

X_train, X_test = X.loc[train_index,], X.loc[test_index,]

is wrong, the cross validation returns numerical indeces, so iloc is the way to go.

The id is not leaking, so this is not a bug in tsfresh.

It is just the dynamics of your time series. (you can for example permute the values of your id column or add a constant to it without changing the values of the feature matrix). It is a strange coincidence but the feature matrix is calculated correctly

i will close this issue

MaxBenChrist commented 7 years ago

Do you have some useful example code on how to do multi class classification based on the features selected by tsfresh?

Maybe you look for some examples in sklearn. The extracted feature matrix has the right format to use them (see maybe http://scikit-learn.org/stable/modules/multiclass.html)

Even the presence of these index based features appears really weird. A classic machine learning pitfall of leaking unintended information to the extracted features.

Nope. There is a difference between id and index. No feature has access to the id.

MaxBenChrist commented 7 years ago

54 is an important issue, and I would be willing to contribute something If I get the hang of how this is supposed to work.

can you explain how this issue related to #54? you are using tsfresh for classification, right?

fkirc commented 7 years ago

can you explain how this issue related to #54? you are using tsfresh for classification, right?

Yes, it may be related if the forecasting task is to predict one outcome out of fixed number of predefined outcomes.

Nope. There is a difference between id and index. No feature has access to the id The id is not leaking, so this is not a bug in tsfresh.

It is just the dynamics of your time series. (you can for example permute the values of your id column or add a constant to it without changing the values of the feature matrix). It is a strange coincidence but the feature matrix is calculated correctly

Really strange, but I can confirm it after adding an offset of 100 to all time series id's... Thank you for your help, I will try to improve on that.

(there is a difference between id, which denotes the id of the device, and index, which are the index or time stamps of the tine series)

So the id gets assigned to each time stamp-value-pair to mark time series (a time series is a sequence of time stamps and values in a pandas DataFrame marked by the same id, as in robot_execution_failures.py). But what is the device? The label assigned to a time series which is used as ground truth for training and testing models?

MaxBenChrist commented 7 years ago

Id denotes the different devices or samples, see http://tsfresh.readthedocs.io/en/v0.6.0/text/quick_start.html

So every sample=device has one class label and can Associated with multiple time series.

MaxBenChrist commented 7 years ago

You can also have a look in our paper:

Christ, M., Kempa-Liehr, A.W. and Feindt, M. (2016). Distributed and parallel time series feature extraction for industrial big data applications. ArXiv e-print 1610.07717, https://arxiv.org/abs/1610.07717.

in section 2 we explain this setup

MaxBenChrist commented 7 years ago

This issue is probably related to #175

MaxBenChrist commented 7 years ago

This is related to #175

MaxBenChrist commented 7 years ago

version 0.7.0 should solve this one as well @fkirc

fkirc commented 7 years ago

Thank you I think this has fixed the issue