DynamicsAndNeuralSystems / pycatch22

python implementation of catch22
https://time-series-features.gitbook.io/catch22/python
GNU General Public License v3.0
76 stars 15 forks source link

Issue with doing multiple time series on Databricks #22

Open choward456 opened 10 months ago

choward456 commented 10 months ago

I am trying to get the features for a bunch of time series, but I keep running into the error below. For example, when I run over 12,000 time series, the first 9,000 work fine, however the loop breaks and kills the kernel when it gets around 9,000. I tried just doing the last 3,000, however, the error still pops up. It works for every time series in the last 3,000 if I groupby and apply the method I made one at a time. The issue appears when I put it in a for loop. It will run a couple of the time series and then this error appears. I have also tried it on different cluster set ups with varying sizes and workers and the error still pops up. Any help would be greatly appreciated. Thanks!

rounds = int((issues_df['time_series_idx'].nunique()))
for i in range(0,rounds):
  reduced_df = issues_df[(issues_df['time_series_idx'].isin([issues_df['time_series_idx'].unique()[i]]))]
  features_df = reduced_df.groupby(['run_id']).apply(catch_24) #works by itself when I do one time series at a time
  features.append(features_df)

Fatal error: The Python kernel is unresponsive.
---------------------------------------------------------------------------
The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).

The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.
---------------------------------------------------------------------------
Last messages on stderr:
y", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fb746ffe640 (most recent call first):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 114 in worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
fkiraly commented 10 months ago

You could try sktime's version, which has parallelization outside pycatch22, using set_config and the parallelization backend param?

https://www.sktime.net/en/latest/api_reference/auto_generated/sktime.transformations.panel.catch22.Catch22.html