Add support for Numba jit/njit `cache=True` to reduce compilation overhead

TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis

https://stumpy.readthedocs.io/en/latest/

Other

3.67k stars 319 forks source link

Add support for Numba jit/njit `cache=True` to reduce compilation overhead #699

Closed MC-Dave closed 1 year ago

MC-Dave commented 2 years ago

Background

Is there a way to set the cache=True kwarg on the various jit/njit decorators used throughout the stumpy codebase?

Curious because there is a significant overhead to invoking a python program which references stumpy. When running small jobs (eg. takes <5 minutes to run the program) the compilation time on something like a compute cluster node (thus, ~2 cores or less) can take up to a minute or longer at times.

In such cases, it would be beneficial to be able to compile stumpy's various numba compiled functions beforehand, so future program executions need to re-compile every time.

seanlaw commented 2 years ago

@MC-Dave Thank you for your question. We've looked at AOT before but ultimately abandoned it as it caused the Package import time to increase dramatically. Having said that, I'm really surprised that it's taking a minute to compile a function. On my machine, it might take 2-3 seconds sometimes but never longer. Are you sure that it's not something else like actual compute time?

Regardless, I have explored the numba documentation and posted a question to their Gitter channel for some guidance. Note that cached functions can only be used on the same machine from where it was compiled and not portable to other hardware architectures. At least, that's how I understand it.

seanlaw commented 2 years ago

It looks like it might be possible to implement something like:

# stump.py

def stumpy(..., cache=False):
    .
    .
    .
    if cache:
        _stump.enable_caching()

    P, I = _stump(
        T_A,
        T_B,
        m,
        M_T,
        μ_Q,
        Σ_T_inverse,
        σ_Q_inverse,
        M_T_m_1,
        μ_Q_m_1,
        T_A_subseq_isfinite,
        T_B_subseq_isfinite,
        T_A_subseq_isconstant,
        T_B_subseq_isconstant,
        diags,
        ignore_trivial,
    )
   .
   .
   .

The cached function will be saved in a directory found in one of the following places:

In-tree cache. Put the cache next to the corresponding source file under a pycache directory following how .pyc files are stored.

User-wide cache. Put the cache in the user’s application directory using appdirs.user_cache_dir from the Appdirs package.

IPython cache. Put the cache in an IPython specific application directory. Stores are made under the numba_cache in the directory returned by IPython.paths.get_ipython_cache_dir().

This is the same place that numba will look for the cached files. Note that this path can be overridden by specifying the NUMBA_CACHE_DIR environment variable.

Here's a simple implementation example:

#!/usr/bin/env python

import numpy as np
from numba import njit

@njit
def inner_func(a):
    for i in range(len(a)):
        a[i] = a[i] + i

def wrapper(a, cache=False):
    if cache:
        inner_func.enable_caching()

    inner_func(a)

if __name__ == "__main__":
    a = np.random.rand(1_000_000)
    wrapper(a, cache=True)

The cache files will be stored in a __pycache__ subdirectory within the same directory as this script but it will be "faster" if you run it again (even without cache=True). Though, one needs to take care and:

Clear the cached copies when necessary
Cache all the way down through nested njit functions

MC-Dave commented 2 years ago

@seanlaw Thank you for all this info, very helpful. I was not aware that one could enable caching on a wrapped numba function in that way.

I haven't done extensive profiling on stumpy/numba compilation. It does appear that when I import the stumpy package in a python program, it hangs for awhile before continuing. When I remove the package import there is no delay in execution.

I will try to experiment with .enable_caching() and stumpy. Do you know what stumpy functions one would need to call .enable_caching on to cache the bulk of the compiled functions needed for stumpy?

stumpy.match comes to mind, though I image there are more.

Thanks again for your help!

seanlaw commented 2 years ago

Thank you for all this info, very helpful. I was not aware that one could enable caching on a wrapped numba function in that way.

Neither did I. I learned it by asking in the numba Gitter channel! They have amazing and knowledgeable contributors there

When I remove the package import there is no delay in execution.

In case it matters, can you make sure that you are using the most up-to-date version of STUMPY?

I will try to experiment with .enable_caching() and stumpy. Do you know what stumpy functions one would need to call .enable_caching on to cache the bulk of the compiled functions needed for stumpy?

There's no easy answer but it would likely be any function that is decorated by the @njit decorator. One would need to scan through the code base to get a better estimate but numba is used heavily in STUMPY to speed up all of our computations.

MC-Dave commented 2 years ago

@seanlaw was on stumpy version 1.10.0. There was definitely a ~45 second compile time on import stumpy when first importing in a new interpreter session (eg. running a python program python program.py, or starting interactive session via python)

After updating to 1.11.1 the import time is ~2-3 seconds. Sometimes it's as simple as updating the package, apologies for not doing that prior to raising the issue.

From my brief understanding of the numba caching system, it seems that it may be worthwhile to add the cache=True argument to some of stumpy's compiled functions, as it could further reduce that 2-3 second overhead. It may cause more side-effects than desired, but if it worked "out-of-the-box" for most users, I'd think it would be worth adding.

Thank you again for your help!

seanlaw commented 2 years ago

After updating to 1.11.1 the import time is ~2-3 seconds. Sometimes it's as simple as updating the package, apologies for not doing that prior to raising the issue.

No worries! I looked back in our commits and remembered that I had to fix this after releasing 1.10.0 as we had added function signatures to all of our njit functions for consistency. However, it turned out that doing so caused numba to do something compile all functions at import time and, hence, the increase from 2-3 seconds to 45 seconds. Once this was reverted then everything was good from an import standpoint.

From my brief understanding of the numba caching system, it seems that it may be worthwhile to add the cache=True argument to some of stumpy's compiled functions, as it could further reduce that 2-3 second overhead. It may cause more side-effects than desired, but if it worked "out-of-the-box" for most users, I'd think it would be worth adding.

After conversing with the numba devs, it isn't recommended to turn cache=True by default unless users know what they are doing. There are a lot of nasty side effects that ultimately make my job much harder from a user support standpoint (i.e., debugging why something isn't behaving as expected due to the presence of a precompiled function that wasn't cleared). Also, note that caching only works on the same hardware architecture (i.e., the cached functions are not portable to other hardware architectures and so you'd ALWAYS need to recompile). In the case where you are staying on the same hardware then caching is great. I'd like to keep this ticket open as a longer term feature request and also allow other users chime in and give a 👍 for adding this feature. For now, I think doing it manually with .enable_caching is your best bet.

MC-Dave commented 2 years ago

@seanlaw One last thought on enabling caching for stumpy by default: would the aforementioned side-effects be mitigated by clearing the numba cache as part of the sstup.py/package install process?

From your side, as the developer of stumpy, you'd want that caching to be disabled, because you are editing the core stumpy compiled functions, and thus would need to regularly clear the cache. However, for the end-user (those installing the package, like myself) would it not be sufficient to clear the numba cache on package install/upgrade? That way, by default, the functions are only compiled once for a given install/update on a hardware stack.

I imagine it is not that simple, or at the least, other side-effects are incurred when AOT compiling.

seanlaw commented 2 years ago

@MC-Dave Unfortunately, having little to no experience with AOT, I have no idea what the side effects may or may not be and, for 99.9% of STUMPY use cases, JIT should be "good enough". To some extent, I'd want this to be handled by numba.

earthgecko commented 1 year ago

@seanlaw One use case where jit caching is very desirable is when stumpy is being used in multiple processes-threads and/or when multiprocessing is used.

jit without caching is only beneficial if the jit compilation is done in a long running process, the initial jit compilation overhead is incurred at the begin and then thereafter the performance gains reaped. However in a use case where say stumpy.stump is being called by multiple short lived processes where there is no jit caching, jit actually results in a significant performance loss, because the entire jit compilation stage is practically useless as it runs once and terminates.

Consider that executing stumpy.stump in a single process can take around 13 seconds, probably around 11 of which are the jit compilation. If stumpy is modified and all the njit decorators are commented out, executing stumpy.stump takes around 2 seconds.

Even when jit caching is enabled and the cache files have been created, the first execution of stumpy.stump takes between 1 and 2 seconds (depending on the current workload being experienced) and any subsequent executions of stumpy.stump in that same process take in the region of 0.009974628977943212 seconds.

Having read this issue but also having a requirement for a version that allows for numba caching, perhaps it could be added based on the presence of the NUMBA_CACHE_DIR environment variable. All Sean's comments above are very valid relating to how difficult it can be to debug caching issues etc, but perhaps when a user knows what they are doing and understands the implications and the requirements that fall on them, stumpy should have the option to allow them to implement caching.

@seanlaw I am happy to do a PR of a version that allows for caching based on the presence of the NUMBA_CACHE_DIR environment variable and/or another or different one say STUMPY_CACHE_ENABLED, my version is now working (caveat #777) and by default is set to cache=False the user would have to explicitly set the environment variable for caching to be enabled and there could be some standard cache warning boilerplate in the docs to advise them they are responsible for managing/pruning the cache, etc.

It would greatly help me if it could be added, otherwise I have to maintain my own fork of stumpy going forward, which is always a drag ;)

Just some thoughts.

seanlaw commented 1 year ago

One use case where jit caching is very desirable is when stumpy is being used in multiple processes-threads and/or when multiprocessing is used.

@earthgecko I think using multiprocessing is your issue and is somewhat of an anti-pattern relative to how stumpy was designed. Natively, unlike numpy which typically uses a single thread, stumpy already uses all available threads on your local machine to compute a matrix profile and, in my opinion would not actually speed up any of your calculations as any additional time series would need to wait for resources to become available (OR waste time cycling through each matrix profile for each time series and inefficiently moving data in and out of memory).

If your short-lived processes are calling stumpy.stump infrequently then perhaps a different architecture/design is desirable (see one potential suggestion below)

However in a use case where say stumpy.stump is being called by multiple short lived processes where there is no jit caching, jit actually results in a significant performance loss, because the entire jit compilation stage is practically useless as it runs once and terminates.

In this particular case, I would personally change the architecture and set up a simple STUMPY RESTful endpoint that accepts a time series and window size as input, computes the matrix profile, and returns the matrix profile following an HTTP POST request. Then, have each short lived process hit the REST endpoint and pass over the inputs in order to compute the matrix profile. Without knowing your situation and without getting into a philosophical debate (I acknowledge that this is somewhat opinionated!), this would be the more appropriate way to ensure modularity and scalability.

At the end of the day, we just have to be willing to accept that "STUMPY can't be everything for everyone" and our goal is to be great for the majority/dominant use case (i.e., long running process). At the end of the day, STUMPY is 100% volunteer and we have very limit time/resources and so we need to be selective in what functionality we choose to add, especially if it will add unnecessary strain to our support and lead to burn out. We value our users but I hope that makes sense.

earthgecko commented 1 year ago

Hi @seanlaw

Thanks for the thoughts, I do not disagree, you just opened it up for other users chime in :) These things can be somewhat opinionated, as is always the case. There are indeed many ways to skin a cat or a stumpy or an isolation forest or a spectral residual, lots of cats :) Spawn a cat with x and analyse t, type of thing. I was just hoping to get away with an existing jit cache cat rather than having to make or get another cat specifically from stumpy :) Not sure why I am using the cats analogy as I have 6 dogs and no cats :)

What do you call a cat with no feet?

seanlaw commented 1 year ago

@earthgecko I do appreciate you chiming in and providing your use case. I will definitely find some time to see if we can find a creative but reasonable solution.

First, let's see if we can figure out the root of the problem you were having in the other issue that you mentioned.

seanlaw commented 1 year ago

Closing this for now as the solution appears to be sufficient but please feel free to reopen/comment if otherwise