aeon-toolkit / aeon

A toolkit for machine learning from time series
https://aeon-toolkit.org/
BSD 3-Clause "New" or "Revised" License
882 stars 93 forks source link

[BUG] Arsenal predict not working properly?! #1696

Open tim-bsm opened 1 week ago

tim-bsm commented 1 week ago

Describe the issue

I was using sktime for Arsenal but because of the split - the reasons stated by Tony in this issue - and because of wanting to test, whether aeon is faster than sktime, I wanted to try aeon. As aeon is a fork of sktime, the basics are the same so it was kind of easy to switch to aeon. Well, that's what I thought. Turns out that the predict of the Arsenal algorithm now uses another type. I'm still trying to use the predict and can't get it to work.

For Arsenal, I'm fitting it in a different file and save its outcome with the joblib.dump function. The generated file is then being loaded back in the main script in which I'm trying to predict the class of my measurement. This always gives me the following error:

TypeError: No matching definition for argument type(s) array(float64, 1d, C), array(float64, 1d, C), Tuple(array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C)), Tuple(array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C)), int64

I'm trying to predict it like that:

acf = joblib.load(arsenal.joblib)
data = np.array(['0.0', '0.0', '0.0', ..., '0.0'], dtype=np.float64)
acf.predict(data)

The list contains 200 strings all being 0.0

After that I went to the examples directory in the aeon git-repo and tried to use the example written there. While reading this example I noticed that the line in "In [2]:" should be motions_test, motions_test_labels = load_basic_motions(split="test") instead of motions_test, motions_test_labels = load_basic_motions(split="train"), if I'm not completely misunderstanding the example. But this just as a side node. So I tried to use the example with which I didn't get the above mentioned error anymore but instead a new one:

Terminating: fork() called from a process already using Gnu OpenMP, this is unsafe.

I assume that this is a direct cause of me using Pythons multiprocessing module to run the prediction parallel with another module?!

Is this a bug or am I doing something completely wrong? As I said, the example works great in sktime.

Suggest a potential alternative/fix

No response

Additional context

No response

TonyBagnall commented 1 week ago

hi and welcome to aeon. This is related to the use of joblib, but not really sure whats going on. As a sanity check, the code below all works fine. Have you got a complete code example so we can give it a go? Most likely someone else here will have a much better idea. It could be numba related, typing often is

thanks for the typo spot, we would appreciate a PR to fix it.

from aeon.classification.convolution_based import Arsenal
import numpy as np

X = np.random.random((10,20))
X2 = np.random.random((10,20))
y = np.array([0,1,0,1,0,1,0,1,0,1])
X3 = np.random.random((10,1,20))
X4 = np.random.random((10,3,20))
afc = Arsenal()
afc.fit(X,y)
print(afc.predict(X2))
afc.fit(X3,y)
print(afc.predict(X2))
afc.fit(X4,y)
print(afc.predict(X4))
tim-bsm commented 1 week ago

Hey @TonyBagnall thanks for the quick response. Here's a quick example. This only shows my problem with the types, as this should be the first one to solve.

from aeon.classification.convolution_based import Arsenal
import numpy as np
import joblib

def save_arsenal():
  print("saving")
  X = np.random.random((10, 20))
  y = np.array([0, 1, 2, 3, 3, 1, 0, 0, 2, 1])
  afc = Arsenal(n_estimators=15, n_jobs=-1, num_kernels=5000, random_state=42, rocket_transform='multirocket')
  afc.fit(X, y)
  joblib.dump(afc, "test.joblib")

def get_arsenal():
  print("getting")
  afc: Arsenal = joblib.load("test.joblib")

  # working
  X2 = np.random.random((10, 20))
  print(afc.predict(X2))

  # create new data
  tmp_meas = ['0.0']
  for _ in range(199):
    tmp_meas.append('0.0')
  print(tmp_meas)
  data = np.array([tmp_meas], dtype=np.float64) # this must be wrong?!
  print(data)

  # not working but does with sktime
  print(afc.predict(data))

save_arsenal()
get_arsenal()
MatthewMiddlehurst commented 1 week ago

Running this now. will take a look and get back to you.

MatthewMiddlehurst commented 1 week ago

So, it is crashing on print(afc.predict(data)) for me. But this makes sense since it is trained on a dataset with 20 series length then data has 200 series length. Thats not the error being raised however, so another issue to work through 🙂.

I think you have found an edge case where the transform does not accept a singular series, as this works:

  for _ in range(19):
    tmp_meas.append('0.0')
  data = np.array([tmp_meas, tmp_meas], dtype=np.float64)

I think if we revert the changes made to the class in https://github.com/aeon-toolkit/aeon/pull/838 it should work @TonyBagnall?

tim-bsm commented 1 week ago

To be fair, I just adopted the code of @TonyBagnall to create a code quickly that produces my error. But yeah, normally the dataset will be around 200 - although not always the complete same amount of data (+-20?). But since this difference of +-20 works with sktime, it should with aeon as well, as this is the same base, right?

So if I understood you correctly, this is a bug which can be fixed on your side, right? If so, can it be done fairly simple or does it take long until it will be released?

MatthewMiddlehurst commented 1 week ago

Yeah we can fix this on our side. Though may take a while to make a release, since we have a lot of other changes currently that we don't want to rush through.

The estimator does not accept unequal length data, so any series length change from the original should not be accepted. It's a bit strange it is getting to the transform where your code is crashing, and I will create an issue to throw and appropriate error when it sees unequal length where that is not allowed. It does not look like the version I wrote originally in the other package has changed to accept unequal length, so I assume its a bug there as well.

Regarding your second error, I can run the notebook without any issues on Windows. I don't know this is a conflict with something else, I would need an example really.

MatthewMiddlehurst commented 1 week ago

Though saying that, it runs the below without issues (I think). Maybe we can configure it to accept unequal length relatively easily. We really need to revamp this code 🙂.

  for _ in range(199):
    tmp_meas.append('0.0')
  data = np.array([tmp_meas, tmp_meas], dtype=np.float64)

Purely implementation though, I'm not sure if the rocket transformation will be outputting any useful features for comparing 20 series length vs 200

tim-bsm commented 1 week ago

For the second part with multiprocessing, the following is crashing (i adopted the temporary fix of @MatthewMiddlehurst):

from aeon.classification.convolution_based import Arsenal
import numpy as np
import joblib
import multiprocessing as mp

def save_arsenal():
  print("saving")
  X = np.random.random((10, 20))
  y = np.array([0, 1, 2, 3, 3, 1, 0, 0, 2, 1])
  afc = Arsenal(n_estimators=15, n_jobs=-1, num_kernels=5000, random_state=42, rocket_transform='multirocket')
  afc.fit(X, y)
  joblib.dump(afc, "test.joblib")

def get_arsenal():
  print("getting")
  afc: Arsenal = joblib.load("test.joblib")

  # working
  X2 = np.random.random((10, 20))
  print(afc.predict(X2))

  # create new data
  tmp_meas = ['0.0']
  for _ in range(199):
    tmp_meas.append('0.0')
  print(tmp_meas)
  data = np.array([tmp_meas, tmp_meas], dtype=np.float64) # this must be wrong?!
  print(data)

  # not working but does with sktime
  print(afc.predict(data))

if __name__ == '__main__':
  save_arsenal()

  # this works (even with range(199))
  # get_arsenal()

  # this does not
  mp.Process(target=get_arsenal, args=()).start()
tim-bsm commented 1 week ago

And another part of the second problem was a typo i think. @TonyBagnall asked me to open a PR but I'm fairly new to GitHub and haven't really opened any real PR so far. So I'm a bit lost on this one to be honest.

Regarding your last comment about the 199 @MatthewMiddlehurst, it does work (can be run with my latest code snippet). I agree that the output is much likely going to be unusable. But as i said, my real values are somewhere around 200 and normally just changing +-4, sometimes +- ~20. So, as far as I noticed, Arsenal ist stable enough and the outcome is good.

MatthewMiddlehurst commented 1 week ago

The above code snippet actually runs without an exception for me. Not very familiar with in-built Python multithreading though, TBH. I'm on Python 3.10, Windows 10 OS.

Don't worry about the typo.

tim-bsm commented 1 week ago

Mhm, strange. For me it gives the error Terminating: fork() called from a process already using Gnu OpenMP, this is unsafe. on Ubuntu with Python 3.12. Can anybody test this on Linux?

TonyBagnall commented 1 week ago

hi, no worries about making the contribution, but if you want to figure out how to do it we are more than happy to help. There is a guide here https://www.aeon-toolkit.org/en/stable/contributing.html starts with this https://www.aeon-toolkit.org/en/stable/developer_guide/dev_installation.html feel free to join our slack, there are lots of people there happy to help. We have all been in the position of getting started, it can be confusing at first but you will get used to it soon. I'll look at the rest later

CodeLionX commented 1 week ago

I can reproduce the error Terminating: fork() called from a process already using Gnu OpenMP, this is unsafe on Linux, Python 3.12, aeon 0.9.0

```shell python -c "from aeon import show_versions; show_versions()" ``` System: python: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] executable: /home/sebastian/.conda/envs/aeon-tmp/bin/python machine: Linux-5.15.0-112-generic-x86_64-with-glibc2.35 Python dependencies: pip: 24.0 setuptools: 69.5.1 scikit-learn: 1.4.2 aeon: 0.9.0 statsmodels: None numpy: 1.26.4 scipy: 1.12.0 pandas: 2.0.3 matplotlib: None joblib: 1.4.2 numba: 0.59.1 pmdarima: None tsfresh: None

OpenMP being incompatible with fork()-based multiprocessing is a known issue. For more details also see https://stackoverflow.com/questions/49049388/understanding-openmp-shortcomings-regarding-fork

I am not familiar with the implementation of Arsenal, but if it is using OpenMP, then the behavior when using fork()-multiprocessing is expected.

You can work around this issue by using a different multiprocessing start method, such as spawn (default on Windows) or forkserver:

if __name__ == '__main__':
  save_arsenal()

  # this works (even with range(199))
  # get_arsenal()

  # this does not
-  mp.Process(target=get_arsenal, args=()).start()
+  ctx = mp.get_context("forkserver")
+  ctx.Process(target=get_arsenal, args=()).start()
tim-bsm commented 1 week ago

Alright, thank you @CodeLionX for letting me know about this. I'll try that tomorrow when I'm back at work and will let you guys know, if it's working.

Just out of curiosity to the others, if they might know - as @CodeLionX said, he does not know about the implementation of Arsenal: Is Arsenal implemented with OpenMP and if so, is there any advantage or reason for that?

Btw, thank you guys for your effort and your quick replies so far :)

MatthewMiddlehurst commented 1 week ago

The implementation uses joblib.Parallel for multithreading (i.e. https://github.com/aeon-toolkit/aeon/blob/main/aeon/classification/convolution_based/_arsenal.py#L309), there may also be some threading via numba in the transform but not 100% on that.

The same for most of our classifiers really.

TonyBagnall commented 1 week ago
from aeon.classification.convolution_based import Arsenal
import numpy as np
import joblib

def save_arsenal():
  print("saving")
  X = np.random.random((10, 20))
  y = np.array([0, 1, 2, 3, 3, 1, 0, 0, 2, 1])
  afc = Arsenal(n_estimators=15, n_jobs=-1, num_kernels=5000, random_state=42, rocket_transform='multirocket')
  afc.fit(X, y)
  joblib.dump(afc, "test.joblib")

def get_arsenal():
  print("getting")
  afc: Arsenal = joblib.load("test.joblib")

  # working
  X2 = np.random.random((10, 20))
  print(afc.predict(X2))

  # create new data

So, it is crashing on print(afc.predict(data)) for me. But this makes sense since it is trained on a dataset with 20 series length then data has 200 series length. Thats not the error being raised however, so another issue to work through 🙂.

I think you have found an edge case where the transform does not accept a singular series, as this works:

  for _ in range(19):
    tmp_meas.append('0.0')
  data = np.array([tmp_meas, tmp_meas], dtype=np.float64)

I think if we revert the changes made to the class in #838 it should work @TonyBagnall?

could you create a separate issue for this @MatthewMiddlehurst? Its not the core of the problem here and accepting different length input in 2D numpy from train to test is not really supported, and Im not really sure it should be, since it would need the capability to handle unequal length

So, it is crashing on print(afc.predict(data)) for me. But this makes sense since it is trained on a dataset with 20 series length then data has 200 series length. Thats not the error being raised however, so another issue to work through 🙂.

I think you have found an edge case where the transform does not accept a singular series, as this works:

  for _ in range(19):
    tmp_meas.append('0.0')
  data = np.array([tmp_meas, tmp_meas], dtype=np.float64)

I think if we revert the changes made to the class in #838 it should work @TonyBagnall?

I think this is a separate issue to the core of the problem tim-bsm is having, perhaps raise a separate issue @MatthewMiddlehurst ?

TonyBagnall commented 1 week ago

Alright, thank you @CodeLionX for letting me know about this. I'll try that tomorrow when I'm back at work and will let you guys know, if it's working.

Just out of curiosity to the others, if they might know - as @CodeLionX said, he does not know about the implementation of Arsenal: Is Arsenal implemented with OpenMP and if so, is there any advantage or reason for that?

Btw, thank you guys for your effort and your quick replies so far :)

the problem is, I imagine, in the multirocket transformer and numba interacting with things strangely

TonyBagnall commented 1 week ago

So just to clarify, ignoring the 1 instance predict for now, this is ok on PC, but does not work on linux? Could this be an issue with all threaded classifiers? I can try tomorrow

from aeon.classification.convolution_based import Arsenal
import numpy as np
import joblib

def save_arsenal():
  print("saving")
  X = np.random.random((10, 20))
  y = np.array([0, 1, 2, 3, 3, 1, 0, 0, 2, 1])
  afc = Arsenal(n_estimators=15, n_jobs=-1, num_kernels=5000, random_state=42, rocket_transform='multirocket')
  afc.fit(X, y)
  joblib.dump(afc, "test.joblib")

def get_arsenal():
  print("getting")
  afc: Arsenal = joblib.load("test.joblib")

  data = np.random.random((5, 20))
  print(data)
  d = afc.predict(data)
  print(d)

save_arsenal()
get_arsenal()
CodeLionX commented 1 week ago

So just to clarify, ignoring the 1 instance predict for now, this is ok on PC, but does not work on linux? Could this be an issue with all threaded classifiers? I can try tomorrow

from aeon.classification.convolution_based import Arsenal
import numpy as np
import joblib

def save_arsenal():
  print("saving")
  X = np.random.random((10, 20))
  y = np.array([0, 1, 2, 3, 3, 1, 0, 0, 2, 1])
  afc = Arsenal(n_estimators=15, n_jobs=-1, num_kernels=5000, random_state=42, rocket_transform='multirocket')
  afc.fit(X, y)
  joblib.dump(afc, "test.joblib")

def get_arsenal():
  print("getting")
  afc: Arsenal = joblib.load("test.joblib")

  data = np.random.random((5, 20))
  print(data)
  d = afc.predict(data)
  print(d)

save_arsenal()
get_arsenal()

Works on my machine (Linux) 🤷🏼

``` saving getting [[4.12579970e-01 9.43962492e-01 4.50538839e-01 5.10200269e-01 6.17581249e-01 4.29840622e-01 4.72601500e-01 7.36845702e-01 2.54153572e-01 6.56264568e-01 5.05757277e-01 7.03395391e-01 8.79386466e-01 1.53694733e-01 1.83096689e-01 2.38425862e-01 1.06455586e-01 8.81482196e-01 2.54574305e-01 6.34024989e-01] [1.43721642e-01 5.78407747e-01 2.48709225e-01 4.06285339e-01 1.99512839e-01 8.34101393e-01 4.45591048e-01 5.40979505e-01 2.80944406e-01 9.85203389e-01 3.27930333e-01 1.10382293e-01 3.04353997e-01 9.31727161e-01 8.90602119e-01 8.04148287e-01 4.05208332e-01 5.61208805e-01 6.85359832e-01 4.61386323e-01] [9.32059296e-01 9.95112689e-01 9.42499646e-01 9.18160027e-01 9.17201937e-01 2.20693115e-01 8.54205515e-01 2.47529375e-01 3.72458403e-01 7.06726359e-01 9.61085329e-01 5.53309263e-05 2.17180248e-01 8.28294063e-01 8.66669904e-01 3.39003289e-01 9.30323336e-01 1.78459067e-01 2.66497503e-01 5.48359740e-01] [5.22741447e-01 7.39107382e-01 6.61867014e-01 2.28299613e-01 4.81490920e-01 7.15845610e-02 4.48867231e-01 1.45556605e-02 9.60642561e-01 5.85185808e-01 9.40618925e-01 7.77600943e-01 8.69418403e-01 8.75651112e-01 3.92136107e-01 4.20679491e-01 5.47434726e-01 2.76389936e-01 4.95750020e-01 2.63786814e-01] [5.06840887e-01 1.81527691e-01 6.77100693e-01 8.94554339e-03 1.99894960e-01 9.16988947e-02 7.09162424e-01 9.83784239e-01 2.78857075e-01 9.31721404e-01 4.45364379e-01 4.33336009e-01 6.59150849e-01 7.22308270e-01 4.38304524e-01 3.03662190e-01 5.69611432e-01 5.92996430e-01 6.02911177e-01 6.60528646e-01]] [1 3 2 1 1] ```
tim-bsm commented 1 week ago

So just to clarify, ignoring the 1 instance predict for now, this is ok on PC, but does not work on linux? Could this be an issue with all threaded classifiers? I can try tomorrow

from aeon.classification.convolution_based import Arsenal
import numpy as np
import joblib

def save_arsenal():
  print("saving")
  X = np.random.random((10, 20))
  y = np.array([0, 1, 2, 3, 3, 1, 0, 0, 2, 1])
  afc = Arsenal(n_estimators=15, n_jobs=-1, num_kernels=5000, random_state=42, rocket_transform='multirocket')
  afc.fit(X, y)
  joblib.dump(afc, "test.joblib")

def get_arsenal():
  print("getting")
  afc: Arsenal = joblib.load("test.joblib")

  data = np.random.random((5, 20))
  print(data)
  d = afc.predict(data)
  print(d)

save_arsenal()
get_arsenal()

As this example is almost the same as the following (my code I posted some messages ago), this works fine for me.

def get_arsenal():
  print("getting")
  afc: Arsenal = joblib.load("test.joblib")

  # working -> almost same as this
  X2 = np.random.random((10, 20))
  print(afc.predict(X2))

The second part of @MatthewMiddlehurst answer worked for me and did the trick. So as he said, it seems like only one entry in a 2D numpy array is somehow not accepted. So for me, I'll just work with this as a work-around until it might be fixed on your side.

So, it is crashing on print(afc.predict(data)) for me. But this makes sense since it is trained on a dataset with 20 series length then data has 200 series length. Thats not the error being raised however, so another issue to work through 🙂.

I think you have found an edge case where the transform does not accept a singular series, as this works:

  for _ in range(19):
    tmp_meas.append('0.0')
  data = np.array([tmp_meas, tmp_meas], dtype=np.float64)

I think if we revert the changes made to the class in #838 it should work @TonyBagnall?

tim-bsm commented 1 week ago

You can work around this issue by using a different multiprocessing start method, such as spawn (default on Windows) or forkserver:

if __name__ == '__main__':
  save_arsenal()

  # this works (even with range(199))
  # get_arsenal()

  # this does not
-  mp.Process(target=get_arsenal, args=()).start()
+  ctx = mp.get_context("forkserver")
+  ctx.Process(target=get_arsenal, args=()).start()

So I just tried it. For this case, I agree, it worked. Thank you @CodeLionX. But in my case, I need the process to be stored in a list, as i have multiple processes. For that, the given method does not work. Simple code below:

from aeon.classification.convolution_based import Arsenal
import numpy as np
import joblib
import multiprocessing as mp

def save_arsenal():
  print("saving")
  X = np.random.random((10, 20))
  y = np.array([0, 1, 2, 3, 3, 1, 0, 0, 2, 1])
  afc = Arsenal(n_estimators=15, n_jobs=-1, num_kernels=5000, random_state=42, rocket_transform='multirocket')
  afc.fit(X, y)
  joblib.dump(afc, "test.joblib")

def get_arsenal():
  print("getting")
  afc: Arsenal = joblib.load("test.joblib")

  # working
  X2 = np.random.random((10, 20))
  print(afc.predict(X2))

  # create new data
  tmp_meas = ['0.0']
  for _ in range(199):
    tmp_meas.append('0.0')
  print(tmp_meas)
  data = np.array([tmp_meas, tmp_meas], dtype=np.float64) # this must be wrong?!
  print(data)

  # not working but does with sktime
  print(afc.predict(data))

if __name__ == '__main__':
  save_arsenal()

  # this works (even with range(199))
  # get_arsenal()

  ctx = mp.get_context("forkserver")  

  # this does as well
  # ctx.Process(target=get_arsenal, args=()).start()

  # this does not
  p = ctx.Process(target=get_arsenal, args=())
  p.start()

Setting the start method like so

if __name__ == '__main__':
  mp.set_start_method("forkserver")

  save_arsenal()
  ...

as stated on some StackOverflow posts, gives me the error RuntimeError: context has already been set

CodeLionX commented 1 week ago

Are you sure? There should be no difference in

ctx.Process(target=get_arsenal, args=()).start()

and

p = ctx.Process(target=get_arsenal, args=())
p.start()

Both work for me with the "forkserver" and "spawn" start methods.

tim-bsm commented 1 week ago

Are you sure? There should be no difference in

ctx.Process(target=get_arsenal, args=()).start()

and

p = ctx.Process(target=get_arsenal, args=())
p.start()

Both work for me with the "forkserver" and "spawn" start methods.

Yeah, I thought the same. Thats why I was a bit confused.

It turns out that it only occurs on my real project and not the little example I'm posting here. Thought I tried it in there as well, before asking you guys. My bad, sorry I tracked it down and I think it has something to do with me using shared queues between the processes. At least now - as i switched the queue from mp.Queue() to cxt.Queue() - I'm getting a new error ^^: AttributeError: Can't pickle local object 'CDLL.__init__.<locals>._FuncPtr'

tim-bsm commented 1 week ago

Can't get the queue to work. Has anyone experience with this one and might be able to help me?

from aeon.classification.convolution_based import Arsenal
import numpy as np
import joblib
import multiprocessing as mp

def save_arsenal():
  print("saving")
  X = np.random.random((10, 20))
  y = np.array([0, 1, 2, 3, 3, 1, 0, 0, 2, 1])
  afc = Arsenal(n_estimators=15, n_jobs=-1, num_kernels=5000, random_state=42, rocket_transform='multirocket')
  afc.fit(X, y)
  joblib.dump(afc, "test.joblib")

def get_arsenal(q: mp.Queue):
  q.put("test")
  print("getting")
  afc: Arsenal = joblib.load("test.joblib")

  # working
  X2 = np.random.random((10, 20))
  print(afc.predict(X2))

  # create new data
  tmp_meas = ['0.0']
  for _ in range(199):
    tmp_meas.append('0.0')
  print(tmp_meas)
  data = np.array([tmp_meas, tmp_meas], dtype=np.float64) # this must be wrong?!
  print(data)

  # not working but does with sktime
  print(afc.predict(data))

if __name__ == '__main__':
  save_arsenal()

  # this works (even with range(199))
  # get_arsenal()

  ctx = mp.get_context("forkserver")
  q = mp.Queue() # -> RuntimeError: A SemLock created in a fork context is being shared with a process in a span context. This is not supported. Please use the same context to create multiprocessing objects and Process.
  q = ctx.Queue() # -> FileNotFoundError: [Errno 2] No such file or directory
  p = ctx.Process(target=get_arsenal, args=(q, ))
  p.start()