TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis
https://stumpy.readthedocs.io/en/latest/
Other
3.68k stars 322 forks source link

Improving Multi-processor Performance #959

Open seanlaw opened 9 months ago

seanlaw commented 9 months ago

This new paper titled "Exploring Multiprocessor Approaches to Time Series Analysis" claims to significantly improve the performance of matrix profile calculations. We should consider looking into this.

Additionally, we should consider if it may be practical to refactor some parts of our code a la cache oblivious algorithms

JaKasb commented 9 months ago

I have 2 open questions:

If the answer to both questions is a "No", I assume that this paper is not useful for stumpy.

seanlaw commented 9 months ago

@JaKasb These are good questions. I don't think numba has Transactional Memory (TM) support. However (I only came across this paper yesterday and only skimmed it), with some creativity/thinking, it might be possible to mimic hardware TM with software TM, though this seems to be an on-going and active area of research. What I am particularly interested in is understanding how a tiling scheme might help speed up the GPU matrix profile computation (and possibly the CPU computation). We've looked at this in the past but it got messy. If none of this is usable then at least we've captured the research, considered it, and discussed it at a minimum.

Certainly, I'm not looking for 3x speedup but even a modest 20% speedup might be worth exploring if there isn't too much added complexity by, say, switching over to tiles for example. @JaKasb Do you see anything else that we can do to improve the speed of our CPU or GPU matrix profile calculations? Any low hanging fruits?

JaKasb commented 9 months ago

All that advanced tiling stuff goes over my head. After the STOMP paper, the papers on speedups and optimizations papers became hard to understand for me. I fully agree with you that such tiling optimizations get messy, and add a lot of code complexity.

Readable code also has intrinsic worth.

In my opinion, the tradeoff is not worth the effort. I mean sure, you can improve the loops for cache locality and whatnot, but it is neccesary to modify all loops and nested variables. Ultimately one would practically have to rewrite stumpy from scratch.

Furthermore, for higher speed one can use approximate matrixprofile algorithms or use multiple GPUs. For my use cases stumpy is already fast enough.