blue-yonder / tsfresh

Automatic extraction of relevant features from time series:
http://tsfresh.readthedocs.io
MIT License
8.43k stars 1.21k forks source link

Feature extraction progress bar fills fast, but some part in the code takes a long time #703

Open TKlerx opened 4 years ago

TKlerx commented 4 years ago
  1. OS: Linux (16 cores)
  2. version: current master branch
  3. a rolled time series
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 627750 entries, 0 to 627749
    Data columns (total 9 columns):
    #   Column      Non-Null Count   Dtype  
    ---  ------      --------------   -----  
    0   ride_id     627750 non-null  uint8  
    1   tick        627750 non-null  uint16 
    2   imu_ax_FFT  627750 non-null  float32
    3   imu_ay_FFT  627750 non-null  float32
    4   imu_az_FFT  627750 non-null  float32
    5   imu_gx_FFT  627750 non-null  float32
    6   imu_gy_FFT  627750 non-null  float32
    7   imu_gz_FFT  627750 non-null  float32
    8   id          627750 non-null  object 
    dtypes: float32(6), object(1), uint16(1), uint8(1)
    memory usage: 68.2 MB

When I run the feature extraction on this rolled dataframe, it takes about 10 minutes until the progress bar is filled to 100% (with using all 16 cores on the system). Then, it takes a very long time (approx. 20 - 30 minutes. maybe even more) until the feature extraction completes (with just utilizing a single core).

If I read the code correctly, it must happen in this part of extraction.py (after _do_extraction_on_chunk):

    # Return a dataframe in the typical form (id as index and feature names as columns)
    result = pd.DataFrame(result)
    if "value" in result.columns:
        result["value"] = result["value"].astype(float)
    if len(result) != 0:
        result = result.pivot("id", "variable", "value")
        result.index = result.index.astype(df[column_id].dtype)

I did not do any profiling, but it must be the creation of the dataframe or the pivotization. Do you have any idea how to speed this up? If it is the pivot, could this be done on the workers?

A second question (I don't want to open another issue for this): Did you consider switching from native pandas to modin to parallelize up e.g. concats and other pandas operations (https://github.com/modin-project/modin)? I know this would be a bigger change :)

nils-braun commented 4 years ago

These are all very valid points you raise. I think it is probably the pivoting. I do not know exactly how many ids you have, but I guess the resulting dataframe will have many columns. pivoting is really slow - we have a super old issue to fix this (https://github.com/blue-yonder/tsfresh/issues/417) but so far no one has tackled this.

In principle we do not need to pivot at all - the data is already nicely partitioned. It should not be a problem to fix this. If you want, I can finally have a look into this (now that we know that it is a problem). If you already have a good idea, I am of course even more happy ;-) My idea would be: the fill in mappings of "id -> data" for all the data in result without building a dataframe. Then just concat the data once. Should be better for computation and maybe also for memory.

Distributing the pivoting is hard unfortunately (not un-doable, but hard) and it will involve a lot of shuffling (even the modin you quote did not implement pivoting so far ;-)). How to do the distribution of work properly depends a lot on the data - something we do not know. But I think using custom logic instead of pivoting can speed this up - so we might not need this anymore.

modin looks like a very good project. Thanks for bringing this up - I did not know about it so far. However, this might be a more general question: So far we assumed (or have heard) that most of our "power user" already use dask or spark or alike. They will just use the "core logic" of tsfresh and build all the dataframe normalisation, result pivoting etc. using their distributed computation system as they will need to include that into their data pipeline anyways. Now I hear from more people wanting to use tsfresh with pandas-like input and that can scale out the feature extraction, so the problems are now the normalisation (see #702) or the result pivoting (see this issue). Our point was always: use pandas, as "default users" want pandas whereas power users will do it manually anyways. But this decision was 4 years ago :-D So it might be worth thinking about it again. What is your opinion @TKlerx, should we switch the input format to something more "non-default" such as monin, which would also mean that many users would need to convert their data before using it with tsfresh?

nils-braun commented 4 years ago

Just a short comment: it seems like the pivoting can be improved by a large factor. Still sorting out the details, but it was totally worth looking into this. Thanks again!

TKlerx commented 4 years ago

Cool, thanks for #705, will look at it in the afternoon (most probably). I have approximately 300k ids from rolling_time_series which are fed into extract_features.

Regarding modin: I wasn't sure either whether to switch to modin dataframes because the dataframes then must be modin dataframes or tsfresh would transform the pandas dataframe to a modin dataframe for internal use in tsfresh. The question is whether the internal use of modin would bring a big speed up or whether a possible transformation from pandas to modin would make the use with pandas way slower. I just wanted to mention modin as it seems quite handy and might make some things faster. On the other hand it is another dependency to another library.

I will do some tests with modin how fast it is when we first transform a pandas dataframe to modin and then do e.g. a concat-operation.

nils-braun commented 4 years ago

705 removed a bit of computational effort (~half) from this part - but the issue is still not solved.

Additional open things to do:

TKlerx commented 4 years ago

Thanks alot @nils-braun. I am really impressed about the effort you put in and the help. I identified (this may be trivial to you) that the main issue arises from the many ids (300k) I create with the roll_time_series method. As a chunk is created for every id and feature this results in a lot of very small chunks (size 30 in my case) which have to be put together in the end (from the mentioned dict of dicts of numbers). I tried a different chunking manually, but don't have any results, yet.

Regarding your comments:

705 removed a bit of computational effort (~half) from this part - but the issue is still not solved.

Yeah, thanks alot. I already tested it 👍

Additional open things to do:

  • we sort the index of the result for backward compatibility. One could introduce a new option to turn that off if you really need fast speed (and ordering is not important to you - although many index calculations work best with that).

That would be great. I do not care about the order as I want to feed the result into a machine learning model with every row being one training sample.

  • Currently we create a pandas dataframe out of a dict of dicts of numbers. Maybe it is faster to create numpy arrays or even pandas objects right aways? However one needs to make sure that (a) we do not loose the id information and (b) creating the array is not slower than using it

Yeah, I guess, this dict of dicts makes things very slow. Concerning (a) I don't know, what the best solution would be. Concerning (b) we could create the array or pandas dataframe already on the worker (so it scales with the cores).

  • if imputing is turned on, it could be preferred to apply deletion of columns already before creating the dataframe - although that might imply a lot of bookkeeping (and not all impute functions just remove columns)

I do not use impute at all (at the moment), so I don't know 🤷‍♂️

TKlerx commented 4 years ago

I wrote an ugly workaround which (at least for my case) consumes way less memory and is way faster. This requires a bit more explanation. I wanted to extract extract features from a sliding window of 30 time ticks from a very long time series (approx 300k entries) and then compute features for every of those 30 tick windows.

My initial code:

def compute_tsfresh_rolling_features_old(df, column_id="ride_id", column_sort="tick", window_size=30):
    df_rolled = roll_time_series(df, column_id=column_id, column_sort=column_sort,
            min_timeshift=window_size-1, max_timeshift=window_size-1)
    df_features = extract_features(df_rolled, column_id="id", column_sort=column_sort)
    return df_features

This took very long and went out of memory (even with 56GB; 112GB were enough, though).

Now the dirty workaround (I also have a parallel version, left out as not needed for understanding). Computing the rolling dataframe is the same as before. Then, I follow some sort of divide & conquer approach to compute the tsfresh features for every id in a separate call and concat everything in the end. Doing so does not seem to be slower (maybe even faster; have to invest deeper) but runs even with 30GB RAM on my notebook. I guess this is because there are not so many dicts of dicts in memory at the same time (which may require a lot of memory).

def compute_tsfresh_rolling_features_sequential(df, column_id="ride_id", column_sort="tick", window_size=30):
    df_rolled = roll_time_series(df, column_id=column_id, column_sort=column_sort,
                                 min_timeshift=window_size-1, max_timeshift=window_size-1)
    groups = df_rolled.groupby('id', sort=False)
    dfs = []
    for name, df_group in groups:
        dfs.append(group_function(df_group))
    df_res = pd.concat(dfs, ignore_index=True, sort=False)
    return df_res

def group_function(group):
    df_features = extract_features(group, column_id="id", column_sort="tick").reset_index()
    return df_features
nils-braun commented 4 years ago

Very interesting! Thanks for the investigation.

I am just wondering, why this is so different from now. Because the groups = df_rolled.groupby('id', sort=False) is also done internally (more or less) and even when the data is written out as a df and concatenated - it still needs to be held in memory. So I guess (as you mentioned), it is really the dict of dict, which seems to be a lot worse in memory... Thinking about it, there could be another thing hiding here: when all data is processed at once, we need to keep the results (in a list of tuples), the input data as well as the current calculation in memory. That might be too much (or did it explicitly fail after the feature calculation has finished and the progress bar done?)

That is definitely something to continuing to study.

Just a side note: if you want to automate this "prepare data once, then do feature extraction on each chunk, then merge it" I would suggest to use a workflow automation tool, such as luigi. If you need some impression how it could look like, I have written a small note at the very end of this about this.

TKlerx commented 4 years ago

Very interesting! Thanks for the investigation.

I am just wondering, why this is so different from now. Because the groups = df_rolled.groupby('id', sort=False) is also done internally (more or less) and even when the data is written out as a df and concatenated - it still needs to be held in memory. So I guess (as you mentioned), it is really the dict of dict, which seems to be a lot worse in memory... Thinking about it, there could be another thing hiding here: when all data is processed at once, we need to keep the results (in a list of tuples), the input data as well as the current calculation in memory. That might be too much (or did it explicitly fail after the feature calculation has finished and the progress bar done?)

That is definitely something to continuing to study.

That's a good question about the progress bar. Cannot tell at the moment, will investigate (hopefully) tomorrow.

Just a side note: if you want to automate this "prepare data once, then do feature extraction on each chunk, then merge it" I would suggest to use a workflow automation tool, such as luigi. If you need some impression how it could look like, I have written a small note at the very end of this about this.

A long time ago, I had a look at luigi. Will have a look at it, again. Thanks for the hint.

nils-braun commented 4 years ago

By the way: thank you for investigation so deeply and trying to solve the problem! Nice to see that people really want to push the boundaries!

TKlerx commented 4 years ago

So I checked the behavior with RAM and the progress bar. The progress bar for the feature extraction completes. With the old pivoting, it crashed with OOM (56GB RAM): image

With the new logic, it does not crash, but seems to swap a bit (32 GB RAM).

For my workaround, I only tried the new version with the improved pivoting (which does not crash, either; 32 GB RAM).

So I still need to compare my workaround with the new pivoting. But the new pivoting is already way better than the old one. 👍

TKlerx commented 4 years ago

A small update on this: I was extracting a lot of features on different files, but the same kind of data. At some point the new pivoting went OOM, so I went back to my workaround, which was still able to compute the features (same PC, same RAM, etc)

EDIT: I will try to replace the dicts with something more memory-friendly. But it will take some time as I am quite busy at the moment.

beyondguo commented 1 day ago
  1. OS: Linux (16 cores)
  2. version: current master branch
  3. a rolled time series
<class 'pandas.core.frame.DataFrame'>
Int64Index: 627750 entries, 0 to 627749
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   ride_id     627750 non-null  uint8  
 1   tick        627750 non-null  uint16 
 2   imu_ax_FFT  627750 non-null  float32
 3   imu_ay_FFT  627750 non-null  float32
 4   imu_az_FFT  627750 non-null  float32
 5   imu_gx_FFT  627750 non-null  float32
 6   imu_gy_FFT  627750 non-null  float32
 7   imu_gz_FFT  627750 non-null  float32
 8   id          627750 non-null  object 
dtypes: float32(6), object(1), uint16(1), uint8(1)
memory usage: 68.2 MB

When I run the feature extraction on this rolled dataframe, it takes about 10 minutes until the progress bar is filled to 100% (with using all 16 cores on the system). Then, it takes a very long time (approx. 20 - 30 minutes. maybe even more) until the feature extraction completes (with just utilizing a single core).


I encounted a similar problem, but my program is stuck before the feature extraction progress bar starts: image

as you can see, the Rolling step has finished for 30mins, but the Feature Extraction step hasn't start. What could be the problem? Also I noticed only one core is working at this period.