Open AlessioBolpagni98 opened 10 months ago
Hi @AlessioBolpagni98 ! Is this deterministic - meaning: it always gets stuck with the same data? Do you see a certain pattern in which data it gets stuck? And which feature calculators are you using?
i encountered the similar issues, my raw dataframe has 1k ids, 27k rows, 140 features. it can be well done by full feature extraction with MultiprocessingDistributor(n_workers=12) on a 64GB machine within 30 mins. but it always hanged with ClusterDaskDistributor with 4 nodes of 64GB workers. and i noticed that it hanged in the result gathering step. after about 4 hours, the extract_features job will be killed out of memory. my environment is : python 3.10.12 tsfresh:0.20.2 dask:2024.7.0 pandas:2.2.2 OS:ubuntu 22.04.1 LTS (Jammy Jellyfish)
@AlessioBolpagni98 have you fixed this issue?
My problem was that i was using the extract_features() function in a improper way. i was using the same column for the parameter 'column_id' and 'column_sort'.
This was my problematic function:
def get_features(df_BTC):
"""extract features using TSfresh, return a dataframe with the features"""
df_BTC = df_BTC.reset_index(drop=False)
# Estrae le caratteristiche
# Retry the function up to 3 times
params = {
"timeseries_container": df_BTC,
"column_sort": "Date",
"column_id": "Date",
}
extracted_features = extract_features(**params)
impute(extracted_features) # inplace
cols_zero = [] # rimuovi le feature con dev. st. nulla
for col in extracted_features.columns:
if extracted_features[col].std() == 0:
cols_zero.append(col)
extracted_features_pulito = extracted_features.drop(columns=cols_zero)
extracted_features_pulito["Date"] = df_BTC["Date"]
return extracted_features_pulito
to solve this in my case all the rows must have the same ID, so i have created an ID 'A' for all the rows
thanks for your reply. in my case , the code can be finished by multiprocess of n_jobs=8 in about 30mins, but it can't be finished in clustered 8 workers on different machines.
i fixed it by shifting to dask_feature_extraction_on_chunk(), the ClusterDaskDistributor still failed with a lot of communication errors
The problem: I have a script that run everyday and in this script i use the tsfresh function extract_features(), but sometimes the script remain stucked in the function with the progressbar blocked at a certain percentage. The function doesn't raise any excpetion and the code remain blocked.
Packages (1).txt