Prime dilation slower on some cases

baraline commented 1 year ago

In some cases, using prime_dilation=True is slower than prime_dilation=False for RDST Ensemble. This can happen for example on the Rock dataset.

[ ] Investigate the issue (it only happens for Ensemble)

baraline commented 1 year ago

When using the benchmark script (3cv) on the Rock dataset, we notice a very high standard deviation for RDST Ensemble Prime: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

n_timestamps | RDST Prime | RDST Ensemble Prime | RDST | RDST Ensemble | Rocket | MultiRocket | DrCIF | TDE | STC | HC2 | RDST Prime_std | RDST Ensemble Prime_std | RDST_std | RDST Ensemble_std | Rocket_std | MultiRocket_std | DrCIF_std | TDE_std | STC_std | HC2_std -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 474 | 1.0630292184650898 | 25.052939302287996 | 1.4401322677731514 | 4.770994765684009 | 9.273215881548822 | 23.353670642711222 | 49.786722905002534 | 77.79590922128409 | 80.43357326462865 | 7884.829313875176 | 0.004903493449091911 | 16.0698274159804 | 0.01821163296699524 | 3.447608422487974 | 0.021983223967254162 | 0.48081249091774225 | 7.353978189639747 | 18.233609943650663 | 0.43430908396840096 | 3.822393278591335 948 | 2.377154231071472 | 25.898940067738295 | 3.0750416861847043 | 3.0481695402413607 | 13.261304871179163 | 24.991533936932683 | 55.76158287934959 | 108.96176979038864 | 101.37790459487587 | 7960.439127580263 | 0.006226222962141037 | 16.065012263134122 | 0.06886849086731672 | 0.03800296410918236 | 0.03066807147115469 | 0.5928529351949692 | 7.1618244629353285 | 12.518209223635495 | 3.9045204231515527 | 0.2534934086725116 1422 | 3.4417717000469565 | 27.04298056382686 | 5.201300728134811 | 8.40116765908897 | 18.263901693746448 | 25.862254047766328 | 64.07761400006711 | 119.52752438280731 | 102.72296567447484 | 7970.518767527305 | 0.011878960765898228 | 15.908938153646886 | 0.006814070977270603 | 3.253817331045866 | 0.9779286533594131 | 0.002123715355992317 | 7.804655771702528 | 24.37082715984434 | 5.647393397986889 | 11.92994621861726 1896 | 4.6401264341548085 | 28.24125813692808 | 8.04301328677684 | 11.572580952197313 | 21.337463438510895 | 27.698388851247728 | 67.51543310005218 | 150.7656864784658 | 111.24626373499632 | 8078.189195295796 | 0.08419013861566782 | 15.739565890282393 | 0.008268513716757298 | 4.111701520159841 | 0.04452386498451233 | 0.7664412679150701 | 9.207885961048305 | 18.048286149278283 | 0.7535054516047239 | 3.2518416047096252 2370 | 6.071481054648757 | 27.885636082850397 | 11.334959764964879 | 13.973823537118733 | 25.491070554591715 | 29.93132807407528 | 77.2411882840097 | 177.13270619604737 | 122.8379875915125 | 8132.091950537637 | 0.0660868901759386 | 15.293029426597059 | 0.4737072614952922 | 2.480648464523256 | 0.006583829410374165 | 0.9237984782084823 | 9.473532510921359 | 0.10296206641942263 | 0.4684769967570901 | 11.067898195236921

A possible cause would an issue with numba not caching the function correctly, and having to recompile some before each new step of validation for RDST Ensemble Prime. This may also be true for RDST Ensemble.

The problem also appears in UCR cross validation run, (only) the first dataset had a high standard deviation for timing, despite a first run on a synthetic dataset before.

baraline commented 1 year ago

Code to reproduce the issue :

from convst.classifiers import R_DST_Ensemble
from convst.utils.dataset_utils import load_sktime_dataset_split
from sktime.classification.kernel_based import RocketClassifier
from timeit import default_timer as timer
import pandas as pd

def time_pipe(pipeline, X_train, y_train, X_test):
    t0 = timer()
    pipeline.fit(X_train, y_train)
    pipeline.predict(X_test)
    t1 = timer()
    return t1-t0

X_train, X_test, y_train, y_test, _ = load_sktime_dataset_split('GunPoint')
rdst = R_DST_Ensemble(n_shapelets_per_estimator=1, n_jobs=-1)
rkt = RocketClassifier(rocket_transform='minirocket', n_jobs=-1)
df = pd.DataFrame()

for i in range(10):
    df.loc[i,'rdst'] = time_pipe(rdst, X_train, y_train, X_test)
    df.loc[i,'rkt'] = time_pipe(rkt, X_train, y_train, X_test)

Results in

        rdst       rkt
0  49.011466  0.335395
1   0.110011  0.351882
2   0.090319  0.370477
3   0.106159  0.346266
4  10.088280  0.368425
5   0.120686  0.353858
6   0.147703  0.352415
7   0.127688  0.335258
8   0.127785  0.354415
9   8.584705  0.341209

If Rocket is not called, the results are similar. The problem would then come from some numba function that need to be compiled again after some runs.

baraline commented 1 year ago

Fixed by #36

baraline / convst

Prime dilation slower on some cases #34