Closed rruizdeaustri closed 3 months ago
I suspect that this might be because of the strict definition of the signature of the transform function in multirocket, which only accepts float64 arrays. In the fit method, X is converted to float64, but not in transform.
What is the datatype of your input ? If it's float32, would converting it to float64 work (the size of the data might become an issue if you don't have enought RAM tho ..., but it's for testing purposes, you can reduce it) ?
If this is the cause of the bug, we would need to discuss why float64 has been made mandatory in the function signature, and if we can relax it to allow other types.
thanks for the bug report. From the trace this comes from fit called on arsenal. This works
from aeon.classification.convolution_based import Arsenal
import numpy as np
shape = (40, 2, 2000)
X = np.random.rand(*shape).astype(np.float32)
y = np.random.randint(0, 2, size=40)
afc = Arsenal()
afc.fit(X, y)
what is the data type for your xtrain?
I would also recommend put a time limit on HC2 if you want to run it on a problem that size
ah ignore that, as @baraline pointed out on slack, I missed that you had set it to multirocket. This does indeed crash, definitely a bug.
from aeon.classification.convolution_based import Arsenal
from aeon.transformations.collection.convolution_based import MultiRocket
from aeon.classification.hybrid import HIVECOTEV2
import numpy as np
shape = (40, 2, 200)
X = np.random.rand(*shape).astype(np.float32)
print(X.shape)
y = np.random.randint(0, 2, size=40)
afc = Arsenal(rocket_transform="multirocket")
afc.fit(X, y)
print("Finished fit for arsenal")
print(afc.predict(X))
wait, its more complex. This crashes with multivariate series
TypeError: No matching definition for argument type(s) array(float32, 3d, C), array(float32, 3d, C)
but not with univariate shape = (40, 1, 200)
for some bizarre reason we have MultiRocketMultivariate and MultiRocket, so problem lies with the former (dont ask why we have these weird versions, its legacy!)
mr = MultiRocketMultivariate()
mr.fit(X)
Xt = mr.transform(X)
gives the same type error. The problem occurs in the numba internal method _transform (confusingly not the one implementing the abstract class).
it has this numba signature
@njit(
"float32[:,:](float64[:,:,:],float64[:,:,:],"
"Tuple((int32[:],int32[:],int32[:],int32[:],float32[:])),"
"Tuple((int32[:],int32[:],int32[:],int32[:],float32[:])),int32)",
fastmath=True,
parallel=True,
cache=True,
)
def _transform(X, X1, parameters, parameters1, n_features_per_kernel=4):
num_examples, num_channels, input_length = X.shape
the univariate version has this
@njit(
"float32[:,:](float64[:,:],float64[:,:],Tuple((int32[:],int32[:],float32[:])),"
"Tuple((int32[:],int32[:],float32[:])),int32)",
fastmath=True,
parallel=True,
cache=True,
)
def _transform(X, X1, parameters, parameters1, n_features_per_kernel):
Hi,
The issue disappeared with the trick of converting the data to float64 but after sometime the code stopped with a memory issue. The data input shape was (400000, 2, 2048), probably to much to handle it with RAM memory and worst if the numbers are 64 bits. Is there no way of using batches in the training to avoid this ?
Thanks !
Rbt
Hey, I think the right way of handling this on our side would be to make those function support both float64 and float32 inputs, we'll discuss the best approach and work on a fix. In the meantime, I see two options if you want to use your full dataset, which unfortunately will include some tinkering :
Edit the sources to modify the float64 in the _transform function to float32. This will fix the problem locally, and hopefully that would allow to not get memory error.
Otherwise, If you can fit the multirocket transformer with the whole data, you can then transform the data and save it in batch to avoid the memory transform here.
To learn a classifier from this batch-transformed data, if memory is still an issue in the transformed format, you would need a sklearn classifier with the update capability, otherwise, you're fine to use a RidgeClassifierCV
as in individual rocket classifiers.
Second option would of course be for only one rocket transformer, to mimic arsenal behavior, you would need to do this n_estimators
times and combine the predictions of all of them using the ensemble scheme used in arsenal (i.e. this function )
personally I would just train it on a subset, ultimately Rocket classifiers are pipelines which generate very large feature spaces. Flip side is you probably really dont need that much data to train, it is after all mostly random. Predict can oc be done in batches
in terms of the code, I think for now just do what rocket does, cast to 32 bits X.astype(np.float32) I think it may have been @dguijo who did that? Whole module needs reworking tbh
@rruizdeaustri should be fixed by #1612 at least in terms of 32 bit/64 bit. We plan to redesign the whole rocket package, but it will always be memory intensive, (see #1126). I dont think you can avoid creating a n_cases, n_kernels array if you use it as the authors proposed. I suggest either reducing the train size or the number of kernels
ok to close this issue?
Yes and thanks a lot for the quick feedback!
Describe the bug
Hi,
I want to use rocket algorithms to classify Gravitational waves. The size of my data is (400000, 2, 2048) where 2 is the number of channels and 2048 is the length of each time series. It does not work.
Thank you !
Roberto
Steps/Code to reproduce the bug
Expected results
Just the classifier works
Actual results
Versions
0.8.1