Open TKlerx opened 4 years ago
These are all very valid points you raise. I think it is probably the pivoting. I do not know exactly how many ids you have, but I guess the resulting dataframe will have many columns. pivoting is really slow - we have a super old issue to fix this (https://github.com/blue-yonder/tsfresh/issues/417) but so far no one has tackled this.
In principle we do not need to pivot at all - the data is already nicely partitioned. It should not be a problem to fix this. If you want, I can finally have a look into this (now that we know that it is a problem). If you already have a good idea, I am of course even more happy ;-) My idea would be: the fill in mappings of "id -> data" for all the data in result without building a dataframe. Then just concat the data once. Should be better for computation and maybe also for memory.
Distributing the pivoting is hard unfortunately (not un-doable, but hard) and it will involve a lot of shuffling (even the modin you quote did not implement pivoting so far ;-)). How to do the distribution of work properly depends a lot on the data - something we do not know. But I think using custom logic instead of pivoting can speed this up - so we might not need this anymore.
modin
looks like a very good project. Thanks for bringing this up - I did not know about it so far.
However, this might be a more general question:
So far we assumed (or have heard) that most of our "power user" already use dask or spark or alike. They will just use the "core logic" of tsfresh and build all the dataframe normalisation, result pivoting etc. using their distributed computation system as they will need to include that into their data pipeline anyways.
Now I hear from more people wanting to use tsfresh
with pandas-like input and that can scale out the feature extraction, so the problems are now the normalisation (see #702) or the result pivoting (see this issue). Our point was always: use pandas, as "default users" want pandas whereas power users will do it manually anyways. But this decision was 4 years ago :-D
So it might be worth thinking about it again.
What is your opinion @TKlerx, should we switch the input format to something more "non-default" such as monin, which would also mean that many users would need to convert their data before using it with tsfresh?
Just a short comment: it seems like the pivoting can be improved by a large factor. Still sorting out the details, but it was totally worth looking into this. Thanks again!
Cool, thanks for #705, will look at it in the afternoon (most probably). I have approximately 300k ids from rolling_time_series
which are fed into extract_features
.
Regarding modin
:
I wasn't sure either whether to switch to modin
dataframes because the dataframes then must be modin
dataframes or tsfresh
would transform the pandas
dataframe to a modin
dataframe for internal use in tsfresh
.
The question is whether the internal use of modin
would bring a big speed up or whether a possible transformation from pandas
to modin
would make the use with pandas
way slower.
I just wanted to mention modin
as it seems quite handy and might make some things faster.
On the other hand it is another dependency to another library.
I will do some tests with modin
how fast it is when we first transform a pandas
dataframe to modin
and then do e.g. a concat
-operation.
Additional open things to do:
Thanks alot @nils-braun. I am really impressed about the effort you put in and the help.
I identified (this may be trivial to you) that the main issue arises from the many ids (300k) I create with the roll_time_series
method. As a chunk is created for every id and feature this results in a lot of very small chunks (size 30 in my case) which have to be put together in the end (from the mentioned dict of dicts of numbers).
I tried a different chunking manually, but don't have any results, yet.
Regarding your comments:
705 removed a bit of computational effort (~half) from this part - but the issue is still not solved.
Yeah, thanks alot. I already tested it 👍
Additional open things to do:
- we sort the index of the result for backward compatibility. One could introduce a new option to turn that off if you really need fast speed (and ordering is not important to you - although many index calculations work best with that).
That would be great. I do not care about the order as I want to feed the result into a machine learning model with every row being one training sample.
- Currently we create a pandas dataframe out of a dict of dicts of numbers. Maybe it is faster to create numpy arrays or even pandas objects right aways? However one needs to make sure that (a) we do not loose the id information and (b) creating the array is not slower than using it
Yeah, I guess, this dict of dicts makes things very slow. Concerning (a) I don't know, what the best solution would be. Concerning (b) we could create the array or pandas dataframe already on the worker (so it scales with the cores).
- if imputing is turned on, it could be preferred to apply deletion of columns already before creating the dataframe - although that might imply a lot of bookkeeping (and not all impute functions just remove columns)
I do not use impute at all (at the moment), so I don't know 🤷♂️
I wrote an ugly workaround which (at least for my case) consumes way less memory and is way faster. This requires a bit more explanation. I wanted to extract extract features from a sliding window of 30 time ticks from a very long time series (approx 300k entries) and then compute features for every of those 30 tick windows.
My initial code:
def compute_tsfresh_rolling_features_old(df, column_id="ride_id", column_sort="tick", window_size=30):
df_rolled = roll_time_series(df, column_id=column_id, column_sort=column_sort,
min_timeshift=window_size-1, max_timeshift=window_size-1)
df_features = extract_features(df_rolled, column_id="id", column_sort=column_sort)
return df_features
This took very long and went out of memory (even with 56GB; 112GB were enough, though).
Now the dirty workaround (I also have a parallel version, left out as not needed for understanding). Computing the rolling dataframe is the same as before. Then, I follow some sort of divide & conquer approach to compute the tsfresh
features for every id in a separate call and concat everything in the end. Doing so does not seem to be slower (maybe even faster; have to invest deeper) but runs even with 30GB RAM on my notebook. I guess this is because there are not so many dicts of dicts in memory at the same time (which may require a lot of memory).
def compute_tsfresh_rolling_features_sequential(df, column_id="ride_id", column_sort="tick", window_size=30):
df_rolled = roll_time_series(df, column_id=column_id, column_sort=column_sort,
min_timeshift=window_size-1, max_timeshift=window_size-1)
groups = df_rolled.groupby('id', sort=False)
dfs = []
for name, df_group in groups:
dfs.append(group_function(df_group))
df_res = pd.concat(dfs, ignore_index=True, sort=False)
return df_res
def group_function(group):
df_features = extract_features(group, column_id="id", column_sort="tick").reset_index()
return df_features
Very interesting! Thanks for the investigation.
I am just wondering, why this is so different from now. Because the groups = df_rolled.groupby('id', sort=False)
is also done internally (more or less) and even when the data is written out as a df and concatenated - it still needs to be held in memory. So I guess (as you mentioned), it is really the dict of dict, which seems to be a lot worse in memory...
Thinking about it, there could be another thing hiding here: when all data is processed at once, we need to keep the results (in a list of tuples), the input data as well as the current calculation in memory. That might be too much (or did it explicitly fail after the feature calculation has finished and the progress bar done?)
That is definitely something to continuing to study.
Just a side note: if you want to automate this "prepare data once, then do feature extraction on each chunk, then merge it" I would suggest to use a workflow automation tool, such as luigi
. If you need some impression how it could look like, I have written a small note at the very end of this about this.
Very interesting! Thanks for the investigation.
I am just wondering, why this is so different from now. Because the
groups = df_rolled.groupby('id', sort=False)
is also done internally (more or less) and even when the data is written out as a df and concatenated - it still needs to be held in memory. So I guess (as you mentioned), it is really the dict of dict, which seems to be a lot worse in memory... Thinking about it, there could be another thing hiding here: when all data is processed at once, we need to keep the results (in a list of tuples), the input data as well as the current calculation in memory. That might be too much (or did it explicitly fail after the feature calculation has finished and the progress bar done?)That is definitely something to continuing to study.
That's a good question about the progress bar. Cannot tell at the moment, will investigate (hopefully) tomorrow.
Just a side note: if you want to automate this "prepare data once, then do feature extraction on each chunk, then merge it" I would suggest to use a workflow automation tool, such as
luigi
. If you need some impression how it could look like, I have written a small note at the very end of this about this.
A long time ago, I had a look at luigi
. Will have a look at it, again. Thanks for the hint.
By the way: thank you for investigation so deeply and trying to solve the problem! Nice to see that people really want to push the boundaries!
So I checked the behavior with RAM and the progress bar. The progress bar for the feature extraction completes. With the old pivoting, it crashed with OOM (56GB RAM):
With the new logic, it does not crash, but seems to swap a bit (32 GB RAM).
For my workaround, I only tried the new version with the improved pivoting (which does not crash, either; 32 GB RAM).
So I still need to compare my workaround with the new pivoting. But the new pivoting is already way better than the old one. 👍
A small update on this: I was extracting a lot of features on different files, but the same kind of data. At some point the new pivoting went OOM, so I went back to my workaround, which was still able to compute the features (same PC, same RAM, etc)
EDIT: I will try to replace the dicts with something more memory-friendly. But it will take some time as I am quite busy at the moment.
- OS: Linux (16 cores)
- version: current master branch
- a rolled time series
<class 'pandas.core.frame.DataFrame'> Int64Index: 627750 entries, 0 to 627749 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ride_id 627750 non-null uint8 1 tick 627750 non-null uint16 2 imu_ax_FFT 627750 non-null float32 3 imu_ay_FFT 627750 non-null float32 4 imu_az_FFT 627750 non-null float32 5 imu_gx_FFT 627750 non-null float32 6 imu_gy_FFT 627750 non-null float32 7 imu_gz_FFT 627750 non-null float32 8 id 627750 non-null object dtypes: float32(6), object(1), uint16(1), uint8(1) memory usage: 68.2 MB
When I run the feature extraction on this rolled dataframe, it takes about 10 minutes until the progress bar is filled to 100% (with using all 16 cores on the system). Then, it takes a very long time (approx. 20 - 30 minutes. maybe even more) until the feature extraction completes (with just utilizing a single core).
I encounted a similar problem, but my program is stuck before the feature extraction progress bar starts:
as you can see, the Rolling step has finished for 30mins, but the Feature Extraction step hasn't start. What could be the problem? Also I noticed only one core is working at this period.
When I run the feature extraction on this rolled dataframe, it takes about 10 minutes until the progress bar is filled to 100% (with using all 16 cores on the system). Then, it takes a very long time (approx. 20 - 30 minutes. maybe even more) until the feature extraction completes (with just utilizing a single core).
If I read the code correctly, it must happen in this part of extraction.py (after
_do_extraction_on_chunk
):I did not do any profiling, but it must be the creation of the dataframe or the pivotization. Do you have any idea how to speed this up? If it is the pivot, could this be done on the workers?
A second question (I don't want to open another issue for this): Did you consider switching from native pandas to modin to parallelize up e.g.
concat
s and other pandas operations (https://github.com/modin-project/modin)? I know this would be a bigger change :)