Closed fkirc closed 7 years ago
I don't fully understand the problem here?
Therefore all features should be only based on the time stamps and the associated values, but not on the order of the samples in my input data.
no feature calculator in tsfresh has access to the id
column.
(there is a difference between id, which denotes the id of the device, and index, which are the index or time stamps of the tine series)
where does your data originate from?
You can also upload an ipython notebook + data so we can have a look at it
The garbage feature data originates from the indices of the time series. I have 32 time series with indices from 1 to 32. Each of these series consist of several values and time stamps, which are ignored for the valueindex_mass_quantileq_* features
Only these indices from 1 to 32 are considered.
which version of tsfresh are you using?
Each of these series consist of several values and time stamps, which are ignored for the valueindex_mass_quantileq_* features
I checked the code but I could not find a leak: So, the function calculators should not have access to the id
column if your parameters (column_id="id", column_sort="time", column_value="value"
) are set correctly.
We can't help you without a minimal example.
Here is the relevant code, I hope you can spot the leak. The extracted features are literally garbage since they fully depend on the order of the input data samples parsed from the file. I use version 0.6.0, installed with pip.
features.py
:
import pandas as pd ...
def get_features(file_record, labels):
time_series, target_classes = construct_tsfresh_input(file_record, labels)
extracted_features = extract_features(time_series, column_sort="time", column_value="value", column_id="id")
impute(extracted_features)
X = select_features(extracted_features, target_classes)
Y = pd.Series(labels)
return X, Y
def construct_tsfresh_input(file_record, labels):
# Create mapping from string labels to integers
label_mapping = {}
label_id = 0
for item in labels:
if label_id > 0 and item in label_mapping:
continue
else:
label_id += 1
label_mapping[item] = label_id
# Build panda frame and series as input for tsfresh
id_to_target = {}
df_rows = []
cur_id = 0
for trace in file_record:
id_to_target[cur_id] = label_mapping[labels[cur_id]]
for point in trace:
time_stamp = point[1]
value = point[0]
df_rows.append([cur_id, time_stamp, value])
cur_id += 1
df = pd.DataFrame(df_rows, columns=['id', 'time', 'value'])
y = pd.Series(id_to_target)
return df, y
classifier.py
:
def main():
file_record, labels = input_parser.parse_input() # custom file parse function
X, Y = ft.get_features(file_record, labels)
folds = 5
kf = KFold(n_splits=folds)
kf.get_n_splits(X)
summed_accuracy = 0
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index,], X.iloc[test_index,]
Y_train, Y_test = Y[train_index], Y[test_index]
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train,Y_train)
predictions = model.predict(X_test)
summed_accuracy += accuracy_score(Y_test, predictions)
print(confusion_matrix(Y_test,predictions))
print(classification_report(Y_test,predictions))
print("Total accuracy: " + str(summed_accuracy / folds))
main()
These are the garbage features extracted by above code for the 32 time series shown in the original post, as you can see all of them depend on the id: Note that a single white space separates the columns.
valueindex_mass_quantileq_0.9 valueindex_mass_quantileq_0.4 \ id
0 0.941176 0.411765
1 1.941176 1.411765
2 2.941176 2.411765
3 3.941176 3.411765
4 4.941176 4.411765
5 5.941176 5.411765
6 6.941176 6.411765
7 7.941176 7.411765
8 8.941176 8.411765
9 9.941176 9.411765
10 10.941176 10.411765
11 11.941176 11.411765
12 12.941176 12.411765
13 13.941176 13.411765
14 14.941176 14.411765
15 15.941176 15.411765
16 16.941176 16.411765
17 17.941176 17.411765
18 18.941176 18.411765
19 19.941176 19.411765
20 20.941176 20.411765
21 21.941176 21.411765
22 22.941176 22.411765
23 23.941176 23.411765
24 24.941176 24.411765
25 25.941176 25.411765
26 26.941176 26.411765
27 27.941176 27.411765
28 28.941176 28.411765
29 29.941176 29.411765
30 30.941176 30.411765
31 31.941176 31.411765
32 32.941176 32.411765value__index_mass_quantile__q_0.1 value__index_mass_quantile__q_0.3 \
id
0 0.117647 0.352941
1 1.117647 1.352941
2 2.117647 2.352941
3 3.117647 3.352941
4 4.117647 4.352941
5 5.117647 5.352941
6 6.117647 6.352941
7 7.117647 7.352941
8 8.117647 8.352941
9 9.117647 9.352941
10 10.117647 10.352941
11 11.117647 11.352941
12 12.117647 12.352941
13 13.117647 13.352941
14 14.117647 14.352941
.......-- Extracted feature columns: Index(['valueindex_mass_quantileq_0.9', 'valueindex_mass_quantileq_0.4', 'valueindex_mass_quantileq_0.1', 'valueindex_mass_quantileq_0.3', 'valueindex_mass_quantileq_0.7', 'valueindex_mass_quantileq_0.8', 'valueindex_mass_quantileq_0.2', 'valueindex_mass_quantileq_0.6'], dtype='object')
The code snippets are not enough, we need the data as well.
If you want us to help, you will have to provide a minimal notebook + data.
Here is the relevant code, I hope you can spot the leak. The extracted features are literally garbage since they fully depend on the order of the input data samples parsed from the file.
I am still not convinced that this originates from tsfresh
X_train, X_test = X.iloc[train_index,], X.iloc[test_index,]
here you should use loc
instead of iloc
Please reopen this issue when you have provided the minimal example
Alright I created an ipython example and uploaded it to a github repository: https://github.com/fkirc/tsfresh-time-series-id-leaking-as-features
Apparently I am not able to reopen this issue.
You can just clone the repo, launch jupyter notebook and execute the first cell, then it should execute the whole code
this is not a minimal example. I will have to look though your whole code.
You can ignore the whole file parsing code, this is just a part extracted from my project. The relevant parts are the k fold validation and the usage of tsfresh itself.
a KNeighborsClassifier
on features does not really makes sense
Actually you can also ignore the k fold validation, since the problem is just about the extracted features which are dependant on the id's
I always see these features containing __index_
, I am sure there must be an easy way to prevent this, since it is clearly wrong to use the order of the input time series as feature. (since the index is nothing else than an ascending number assigned by my pandas dataframe creation code for each time series)
Either it is a major issue, or I just overlook something more obvious.
I will try to drop all feature columns that are based on the time series indices before calling select_features
Do you have some useful example code on how to do multi class classification based on the features selected by tsfresh?
Even the presence of these index based features appears really weird. A classic machine learning pitfall of leaking unintended information to the extracted features.
My sample data set deliberately has only two target labels with exactly the same time series, nevertheless I can achieve 100% classification accuracy with knn if I put the input data in the right order, e.g. that is that all indices labeled as class 1 are lower than indices labeled as class 2.
my comment about
X_train, X_test = X.loc[train_index,], X.loc[test_index,]
is wrong, the cross validation returns numerical indeces, so iloc
is the way to go.
The id is not leaking, so this is not a bug in tsfresh.
It is just the dynamics of your time series. (you can for example permute the values of your id column or add a constant to it without changing the values of the feature matrix). It is a strange coincidence but the feature matrix is calculated correctly
i will close this issue
Do you have some useful example code on how to do multi class classification based on the features selected by tsfresh?
Maybe you look for some examples in sklearn. The extracted feature matrix has the right format to use them (see maybe http://scikit-learn.org/stable/modules/multiclass.html)
Even the presence of these index based features appears really weird. A classic machine learning pitfall of leaking unintended information to the extracted features.
Nope. There is a difference between id
and index. No feature has access to the id
.
54 is an important issue, and I would be willing to contribute something If I get the hang of how this is supposed to work.
can you explain how this issue related to #54? you are using tsfresh for classification, right?
can you explain how this issue related to #54? you are using tsfresh for classification, right?
Yes, it may be related if the forecasting task is to predict one outcome out of fixed number of predefined outcomes.
Nope. There is a difference between id and index. No feature has access to the id The id is not leaking, so this is not a bug in tsfresh.
It is just the dynamics of your time series. (you can for example permute the values of your id column or add a constant to it without changing the values of the feature matrix). It is a strange coincidence but the feature matrix is calculated correctly
Really strange, but I can confirm it after adding an offset of 100 to all time series id's... Thank you for your help, I will try to improve on that.
(there is a difference between id, which denotes the id of the device, and index, which are the index or time stamps of the tine series)
So the id gets assigned to each time stamp-value-pair to mark time series (a time series is a sequence of time stamps and values in a pandas DataFrame marked by the same id, as in robot_execution_failures.py
).
But what is the device? The label assigned to a time series which is used as ground truth for training and testing models?
Id denotes the different devices or samples, see http://tsfresh.readthedocs.io/en/v0.6.0/text/quick_start.html
So every sample=device has one class label and can Associated with multiple time series.
You can also have a look in our paper:
Christ, M., Kempa-Liehr, A.W. and Feindt, M. (2016). Distributed and parallel time series feature extraction for industrial big data applications. ArXiv e-print 1610.07717, https://arxiv.org/abs/1610.07717.
in section 2 we explain this setup
This issue is probably related to #175
This is related to #175
version 0.7.0 should solve this one as well @fkirc
Thank you I think this has fixed the issue
I attempt to use tsfresh for a simple binary classification using a k-nearest-neighbor-classifier and k-fold-validation. However, the classification accuracy depends on the order of the input time series, which should not be relevant at all.
The underlying problem are the features selected by
select_features
:value__index_mass_quantile__q_0.8 value__index_mass_quantile__q_0.7 value__index_mass_quantile__q_0.2 value__index_mass_quantile__q_0.3
and so on. All of them are directly proportional to theid
in the training data set.Now the k-nearest-neighbor classifier just has to decide whether these index "features" are above a certain threshold to make a correct classification.
I need to disable the consideration of the index for feature extraction. Using the index of the samples in my training data as input for feature extraction reduces my model to absurdity. All features should be only based on the time stamps and the associated values, but not on the order of the samples in my input data.
How can I disable this incorrect behavior?
The
time_series
data frame is constructed in the same way as the robots example:The same goes for the target labels y: