Problem with the function extract_features

blue-yonder / tsfresh

Automatic extraction of relevant features from time series:

http://tsfresh.readthedocs.io

MIT License

8.44k stars 1.21k forks source link

Problem with the function extract_features #1058

Open AlessioBolpagni98 opened 10 months ago

AlessioBolpagni98 commented 10 months ago

The problem: I have a script that run everyday and in this script i use the tsfresh function extract_features(), but sometimes the script remain stucked in the function with the progressbar blocked at a certain percentage. The function doesn't raise any excpetion and the code remain blocked.
Packages (1).txt

Python version: 3.10.12
Operating System: Linux Ubuntu 22.04.3
tsfresh version: 0.20.1
Install method (conda, pip, source): pip

nils-braun commented 10 months ago

Hi @AlessioBolpagni98 ! Is this deterministic - meaning: it always gets stuck with the same data? Do you see a certain pattern in which data it gets stuck? And which feature calculators are you using?

sidneyzhu commented 4 months ago

i encountered the similar issues, my raw dataframe has 1k ids, 27k rows, 140 features. it can be well done by full feature extraction with MultiprocessingDistributor(n_workers=12) on a 64GB machine within 30 mins. but it always hanged with ClusterDaskDistributor with 4 nodes of 64GB workers. and i noticed that it hanged in the result gathering step. after about 4 hours, the extract_features job will be killed out of memory. my environment is : python 3.10.12 tsfresh：0.20.2 dask：2024.7.0 pandas：2.2.2 OS：ubuntu 22.04.1 LTS (Jammy Jellyfish)

sidneyzhu commented 4 months ago

@AlessioBolpagni98 have you fixed this issue?

AlessioBolpagni98 commented 4 months ago

My problem was that i was using the extract_features() function in a improper way. i was using the same column for the parameter 'column_id' and 'column_sort'.

This was my problematic function:

def get_features(df_BTC):
    """extract features using TSfresh, return a dataframe with the features"""
    df_BTC = df_BTC.reset_index(drop=False)
    # Estrae le caratteristiche
    # Retry the function up to 3 times
    params = {
        "timeseries_container": df_BTC,
        "column_sort": "Date",
        "column_id": "Date",
    }

    extracted_features = extract_features(**params)
    impute(extracted_features)  # inplace

    cols_zero = []  # rimuovi le feature con dev. st. nulla
    for col in extracted_features.columns:
        if extracted_features[col].std() == 0:
            cols_zero.append(col)
    extracted_features_pulito = extracted_features.drop(columns=cols_zero)
    extracted_features_pulito["Date"] = df_BTC["Date"]

    return extracted_features_pulito

AlessioBolpagni98 commented 4 months ago

to solve this in my case all the rows must have the same ID, so i have created an ID 'A' for all the rows

sidneyzhu commented 4 months ago

thanks for your reply. in my case , the code can be finished by multiprocess of n_jobs=8 in about 30mins, but it can't be finished in clustered 8 workers on different machines.

sidneyzhu commented 4 months ago

i fixed it by shifting to dask_feature_extraction_on_chunk()， the ClusterDaskDistributor still failed with a lot of communication errors