blue-yonder / tsfresh

Automatic extraction of relevant features from time series:
http://tsfresh.readthedocs.io
MIT License
8.36k stars 1.21k forks source link

Function: extract_relevant_features: throws AssertionError: X and y must contain the same number of samples. #945

Open lthiess8 opened 2 years ago

lthiess8 commented 2 years ago

Hi, I get an assertion error when using the fuction extract_relevant_features(). When I print len(X) and len(y), I get the same values.

Thanks in advance!

36965
36965
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-5-59dfec12df74> in <module>
     24     print(len(df))
     25     print(len(target))
---> 26     extracted_relevant_features = extract_relevant_features(df, target, column_id='abgang', column_sort='time',  column_value = 'values', default_fc_parameters=EfficientFCParameters(), ml_task='classification')
     27     extracted_features = extract_features(df, column_id='abgang', column_sort='time',  column_value = 'values', default_fc_parameters=EfficientFCParameters(),n_jobs=8, disable_progressbar=True)
     28 

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/tsfresh/convenience/relevant_extraction.py in extract_relevant_features(timeseries_container, y, X, default_fc_parameters, kind_to_fc_parameters, column_id, column_sort, column_kind, column_value, show_warnings, disable_progressbar, profile, profiling_filename, profiling_sorting, test_for_binary_target_binary_feature, test_for_binary_target_real_feature, test_for_real_target_binary_feature, test_for_real_target_real_feature, fdr_level, hypotheses_independent, n_jobs, distributor, chunksize, ml_task)
    198     )
    199 
--> 200     X_sel = select_features(
    201         X_ext,
    202         y,

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/tsfresh/feature_selection/selection.py in select_features(X, y, test_for_binary_target_binary_feature, test_for_binary_target_real_feature, test_for_real_target_binary_feature, test_for_real_target_real_feature, fdr_level, hypotheses_independent, n_jobs, show_warnings, chunksize, ml_task, multiclass, n_significant)
    152     )
    153     assert len(y) > 1, "y must contain at least two samples."
--> 154     assert len(X) == len(y), "X and y must contain the same number of samples."
    155     assert (
    156         len(set(y)) > 1

AssertionError: X and y must contain the same number of samples.
CelieDs commented 2 years ago

Hello! I encountered the same issue, did you manage to find a solution? Thanks in advance

lthiess8 commented 2 years ago

Hello @CelieDs,

for some reason the indices of X and y did not match. This notebook helped me to find the solution: https://github.com/blue-yonder/tsfresh/blob/main/notebooks/advanced/05%20Timeseries%20Forecasting%20(multiple%20ids).ipynb

when i changed the code to the following, it worked for me:

target = df_melted.set_index("time").sort_index().label

target = target[target.index.isin(extracted_features.index)]
extracted_features = extracted_features[extracted_features.index.isin(target.index)]

features_selected = select_features(extracted_features, target, ml_task='classification')