[BUG] Problems with the size of data

rruizdeaustri commented 4 months ago

Describe the bug

Hi,

I want to use rocket algorithms to classify Gravitational waves. The size of my data is (400000, 2, 2048) where 2 is the number of channels and 2048 is the length of each time series. It does not work.

Thank you !

Roberto

Steps/Code to reproduce the bug

import sys
import numpy as np
import h5py
import time
from pathlib import Path

import tensorflow as tf
import matplotlib.pyplot as plt

from utils.configfiles import get_config
from utils.datasets import load_data_into_numpy, InjectionSNR
from utils.metrics import auc_snr_eval
import json

from aeon.classification.deep_learning import LITETimeClassifier
from aeon.classification.hybrid import HIVECOTEV1, HIVECOTEV2
from aeon.classification.convolution_based import Arsenal

from sklearn.metrics import roc_auc_score

# -----------------------------------------------------------------------------
# MAIN CODE
# -----------------------------------------------------------------------------

model = 'multirocket'

if __name__ == '__main__':

    # -------------------------------------------------------------------------
    # Preliminaries
    # -------------------------------------------------------------------------
    print(tf.config.list_physical_devices('GPU'))

    # Example usage with your configuration settings
    config = get_config()
    xtrain, ytrain = load_data_into_numpy(config['data']['training'])
    xtest, ytest = load_data_into_numpy(config['data']['testing'])

    injections_snr = InjectionSNR()

    if model == 'LITETime': 
     clf = LITETimeClassifier(batch_size=32, n_classifiers=5, n_epochs=50, file_path='checkpoints/', s
ave_best_model=True, best_file_name="best_model", verbose=True)
    elif model == 'hivecote':
     clf = HIVECOTEV2(time_limit_in_minutes=0.2, verbose=1)   
    elif model == 'multirocket':
     clf = Arsenal(rocket_transform="multirocket")
    else:
     print('wrong model')   
     sys.exit()   

    # Check unique values in data
    #unique_values, counts = np.unique(ytrain, return_counts=True)
    #print(f"Unique values in predictions: {unique_values}")
    #print(f"Counts of unique values: {counts}")
    #sys.exit()
    print(f"xtrain shape: {xtrain.shape}, type: {xtrain.dtype}")
    clf.fit(xtrain, ytrain)

    #Compute AUC versus SNR and plot
    ypred = clf.predict(xtest)

    # Check unique values in predictions
    unique_values, counts = np.unique(ypred, return_counts=True)
    print(f"Unique values in predictions: {unique_values}")
    print(f"Counts of unique values: {counts}")
    #print(ypred.shape, ypred, ytest)
    sys.exit()
    # Assume non-signal data are those with a true label of 0
    non_signal_indices = np.where(ytest == 0)[0]

    # Function recieves the scores and computes the bin AUCs
    auc_snr = auc_snr_eval(injections_snr, ypred, ytest, non_signal_indices)

    print(clf.score(xtest, ytest))

    # -------------------------------------------------------------------------
    # Save results as a JSON file
    # -------------------------------------------------------------------------
    print('Saving auc versus snr results to JSON file...', end=' ', flush=True)
    with open('results/metrics/auc_over_snr_aeon.json', 'w') as json_file:
        json.dump(auc_snr, json_file, sort_keys=True, indent=2)

    print('Done!') 

    # Extract data from results
    snr_bins = [(float(a), float(b)) for (a, b) in auc_snr['snr_bins']]
    auc_ratios = np.array(auc_snr['auc']).astype(float)
    grid = [np.mean(_) for _ in snr_bins]

    # Initialize a color cycle for plotting
    #colors = plt.cm.jet(np.linspace(0, 1, 1))

    # Plot the data with a different color for each curve
    plt.plot(grid, auc_ratios, marker='o', ms=2, mew=0.5, linestyle='-', label='aeon')

    # Initialize the legend list
    legend_labels = []

    # Add the model name to the legend
    legend_labels.append('CNN')

    # Configure the plot and add a legend
    plt.xlabel('SNR')
    plt.ylabel('AUC')
    plt.legend(legend_labels, loc='best')
    plt.grid(True)

    # Construct path to save this plot
    plots_dir = './plots'
    Path(plots_dir).mkdir(exist_ok=True)
    file_path = os.path.join(plots_dir, 'auc_snr_combined.pdf')

    # Save the plot as a PDF
    print('Saving plot as PDF...', end=' ', flush=True)
    plt.savefig(file_path, bbox_inches='tight', pad_inches=0)
    print('Done!', flush=True)
    #plt.show()

    auc = roc_auc_score(ytest, ypred)

    print('Test set AUC: {:.2f}%'.format(100.*auc))

    AUC = []
    AUC.append(auc)

    np.savetxt('results/metrics/auc_aeon.txt', AUC)

    print(80 * '-' + '\n\n' + 'Testing complete!')

    # -------------------------------------------------------------------------
    # Postliminaries
    # -------------------------------------------------------------------------

    print('')
    print(f'This took {time.time() - script_start:.1f} seconds!')
    print('')

Expected results

Just the classifier works

Actual results

 Traceback (most recent call last):
  File "/lustre/home/ific/rruiz/projects/gws/aeon/main.py", line 83, in <module>
    clf.fit(xtrain, ytrain)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/classification/base.py", line 129, in fit
    self._fit(X, y)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/classification/convolution_based/_arsenal.py", line 171, in _fit
    self._fit_arsenal(X, y)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/classification/convolution_based/_arsenal.py", line 335, in _fit_arsenal
    fit = Parallel(n_jobs=self._n_jobs, prefer="threads")(
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/joblib/parallel.py", line 1918, in __call__
    return output if self.return_generator else list(output)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/joblib/parallel.py", line 1847, in _get_sequential_output
    res = func(*args, **kwargs)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/classification/convolution_based/_arsenal.py", line 367, in _fit_ensemble_estimator
    transformed_x = rocket.fit_transform(X)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/transformations/collection/base.py", line 161, in fit_transform
    Xt = self._fit_transform(X=X_inner, y=y_inner)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/transformations/collection/base.py", line 326, in _fit_transform
    return self._transform(X, y)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/transformations/collection/convolution_based/_multirocket_multivariate.py", line 168, in _transform
    X = _transform(
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/numba/core/dispatcher.py", line 703, in _explain_matching_error
    raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(float32, 3d, C), array(float32, 3d, C), Tuple(array(int32, 1d, C), array(int32, 1d, C), array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C)), Tuple(array(int32, 1d, C), array(int32, 1d, C), array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C)), int64

Versions

0.8.1

baraline commented 4 months ago

I suspect that this might be because of the strict definition of the signature of the transform function in multirocket, which only accepts float64 arrays. In the fit method, X is converted to float64, but not in transform.

What is the datatype of your input ? If it's float32, would converting it to float64 work (the size of the data might become an issue if you don't have enought RAM tho ..., but it's for testing purposes, you can reduce it) ?

If this is the cause of the bug, we would need to discuss why float64 has been made mandatory in the function signature, and if we can relax it to allow other types.

TonyBagnall commented 4 months ago

thanks for the bug report. From the trace this comes from fit called on arsenal. This works

from aeon.classification.convolution_based import Arsenal
import numpy as np
shape = (40, 2, 2000)
X = np.random.rand(*shape).astype(np.float32)
y = np.random.randint(0, 2, size=40)
afc = Arsenal()
afc.fit(X, y)

what is the data type for your xtrain?

TonyBagnall commented 4 months ago

I would also recommend put a time limit on HC2 if you want to run it on a problem that size

TonyBagnall commented 4 months ago

ah ignore that, as @baraline pointed out on slack, I missed that you had set it to multirocket. This does indeed crash, definitely a bug.

from aeon.classification.convolution_based import Arsenal
from aeon.transformations.collection.convolution_based import MultiRocket
from aeon.classification.hybrid import HIVECOTEV2
import numpy as np
shape = (40, 2, 200)
X = np.random.rand(*shape).astype(np.float32)
print(X.shape)
y = np.random.randint(0, 2, size=40)
afc = Arsenal(rocket_transform="multirocket")
afc.fit(X, y)
print("Finished fit for arsenal")
print(afc.predict(X))

wait, its more complex. This crashes with multivariate series

TypeError: No matching definition for argument type(s) array(float32, 3d, C), array(float32, 3d, C)

but not with univariate shape = (40, 1, 200)

TonyBagnall commented 4 months ago

for some bizarre reason we have MultiRocketMultivariate and MultiRocket, so problem lies with the former (dont ask why we have these weird versions, its legacy!)

mr = MultiRocketMultivariate()
mr.fit(X)
Xt = mr.transform(X)

gives the same type error. The problem occurs in the numba internal method _transform (confusingly not the one implementing the abstract class).

it has this numba signature

@njit(
    "float32[:,:](float64[:,:,:],float64[:,:,:],"
    "Tuple((int32[:],int32[:],int32[:],int32[:],float32[:])),"
    "Tuple((int32[:],int32[:],int32[:],int32[:],float32[:])),int32)",
    fastmath=True,
    parallel=True,
    cache=True,
)
def _transform(X, X1, parameters, parameters1, n_features_per_kernel=4):
    num_examples, num_channels, input_length = X.shape

the univariate version has this

@njit(
    "float32[:,:](float64[:,:],float64[:,:],Tuple((int32[:],int32[:],float32[:])),"
    "Tuple((int32[:],int32[:],float32[:])),int32)",
    fastmath=True,
    parallel=True,
    cache=True,
)
def _transform(X, X1, parameters, parameters1, n_features_per_kernel):

rruizdeaustri commented 3 months ago

Hi,

The issue disappeared with the trick of converting the data to float64 but after sometime the code stopped with a memory issue. The data input shape was (400000, 2, 2048), probably to much to handle it with RAM memory and worst if the numbers are 64 bits. Is there no way of using batches in the training to avoid this ?

Thanks !

Rbt

baraline commented 3 months ago

Hey, I think the right way of handling this on our side would be to make those function support both float64 and float32 inputs, we'll discuss the best approach and work on a fix. In the meantime, I see two options if you want to use your full dataset, which unfortunately will include some tinkering :

Edit the sources to modify the float64 in the _transform function to float32. This will fix the problem locally, and hopefully that would allow to not get memory error.
Otherwise, If you can fit the multirocket transformer with the whole data, you can then transform the data and save it in batch to avoid the memory transform here.
- If fit throws a memory error, you could fit only on part of the input and do the batch transform. I'm not sure on the impact of not fitting with the full data, but as rocket kernels are mostly random, it should be somewhat “fine”.
To learn a classifier from this batch-transformed data, if memory is still an issue in the transformed format, you would need a sklearn classifier with the update capability, otherwise, you're fine to use a RidgeClassifierCV as in individual rocket classifiers.

Second option would of course be for only one rocket transformer, to mimic arsenal behavior, you would need to do this n_estimators times and combine the predictions of all of them using the ensemble scheme used in arsenal (i.e. this function )

TonyBagnall commented 3 months ago

personally I would just train it on a subset, ultimately Rocket classifiers are pipelines which generate very large feature spaces. Flip side is you probably really dont need that much data to train, it is after all mostly random. Predict can oc be done in batches

TonyBagnall commented 3 months ago

in terms of the code, I think for now just do what rocket does, cast to 32 bits X.astype(np.float32) I think it may have been @dguijo who did that? Whole module needs reworking tbh

TonyBagnall commented 3 months ago

@rruizdeaustri should be fixed by #1612 at least in terms of 32 bit/64 bit. We plan to redesign the whole rocket package, but it will always be memory intensive, (see #1126). I dont think you can avoid creating a n_cases, n_kernels array if you use it as the authors proposed. I suggest either reducing the train size or the number of kernels

ok to close this issue?

rruizdeaustri commented 3 months ago

Yes and thanks a lot for the quick feedback!

aeon-toolkit / aeon