Resampy 0.3.1 slows down other libraries

sgugger commented 2 years ago

Hi there!

We just debugged a problem on the Transformers CI that was suddenly taking way longer than before (by a x20 factor) and it all came down to the new release of resampy (pinning to 0.3.0 fixed the issue). We have an indirect dependency via librosa. The strange things is that it makes tests that do not use resampy at all way way slower. Are you changing things in other libraries maybe?

I don't have a nice reproducer for you, sorry, but just wanted to let you know!

bmcfee commented 2 years ago

That's deeply weird. If you can't provide a reproducing example, can you at least describe more about what the environment that triggers slow-down looks like?

Nothing in librosa has changed relative to the resampy updates.

The only thing that comes to mind is that resampy 0.3.1 has a bit more going on at compilation time than 0.3.0 (different dtype targets are explicitly listed), but a 20x slowdown in unrelated code makes zeros sense to me.

sgugger commented 2 years ago

We observed the slowdown in circleCI while running our tests, the whole instructions are in this circleCI config file. I realize it may be difficult to reproduce, but basically all jobs that install the torch-speech dependency were getting really slow, so anything like:

pip install transformers[sklearn,torch,testing,sentencepiece,torch-speech]

Our best guess is that it somehow changed things in Numba/NumPy that then reverted to a slow version, but nothing more concrete than that.

bmcfee commented 2 years ago

Our best guess is that it somehow changed things in Numba/NumPy that then reverted to a slow version, but nothing more concrete than that.

Strange. The dependencies have not changed from 0.3.0 to 0.3.1, so I don't see why that would happen.

The only thing that changed was how we handle multi-channel resampling (see PR #104 ) which now uses numba generalized ufunc vectorization to bring down the interpolation loop to operate independently for each channel. In moving to @guvectorize, I had to explicitly set out the dtype targets, which I could envision adding a little overhead, but only when the functions are actually called (and then, only the first time).

The only time this should actually have an observable effect is when resampling multichannel signals, and then, it should be strictly better in 0.3.1 (see #102 for discussion). If you're not doing multi-channel, performance should be equivalent (and match or beat that of the 0.2 series).

The other thing that could be happening here is that resampy is going parallel and interfering with other processes in your CI. You could check this quickly by limiting the number of threads that numba can take (see docs) in your environment. At present, resampy exposes an interface to disable parallelism, but librosa does not (see this issue for what needs to happen there), so a global override is probably your best strategy for determining if this is the culprit if you're only accessing resampy through librosa.

bmcfee commented 2 years ago

@sgugger curious if you've made any headway on this?

vBaiCai commented 2 years ago

We encountered the same weird problem ...

The inference speed of the transformer will slow down with the resampy 0.3.1.

Downgrading the resampy ( 0.3.1 -> 0.2.2) solved the problem.

bmcfee commented 2 years ago

@vBaiCai thanks - can you provide any kind of hints for reproducing this behavior though?

ydshieh commented 2 years ago

I tried to reproduce, even using CircleCI's convenient docker image running on a GCP VM which has no issue. So far I can only see this issue when the code is running on CircleCI (with docker executor).

@vBaiCai: do you also see this issue on CircleCI, or some other environments?

The issue occurs after the commit a36275670f284e8f09d71ef3515dc7513aeba6e6 (fixed #102, fast resampling in high dimensions).

ydshieh commented 2 years ago

With some more investigation, here is my finding

The parallel=True in _resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop) causes the problem.
- If I remove @guvectorize (for resample_f_p), no slow down.
- If I change to parallel=False, no slow down.
It doesn't matter if the methods are called or not. As long as they are imported, the issue occurs.
I can only observe this issue on CircleCI docker executor.

Results (simplified case, see reproduce section below)

_resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop):
- 6.20s/it
- https://app.circleci.com/pipelines/github/huggingface/transformers/44062/workflows/91a929fa-cf95-4867-8e8b-8b68a7a24be0/jobs/511312
_resample_loop_p = jit(nopython=True, nogil=True, parallel=False)(_resample_loop):
- 47.26it/s
- https://app.circleci.com/pipelines/github/huggingface/transformers/44064/workflows/dd491839-2aa8-47a2-83e4-ab632bbfcf65/jobs/511328

To reproduce (running on CircleCI docker executor)

import torch
from torch import nn
from tqdm import tqdm

"""Numba implementation of resampler"""
from numba import (
    guvectorize,
    float32,
    float64,
    jit,
    prange,
    int16,
    int32,
    int64,
    complex64,
    complex128,
)

def _resample_loop(x, t_out, interp_win, interp_delta, num_table, scale, y):

    index_step = int(scale * num_table)
    time_register = 0.0

    n = 0
    frac = 0.0
    index_frac = 0.0
    offset = 0
    eta = 0.0
    weight = 0.0

    nwin = interp_win.shape[0]
    n_orig = x.shape[0]
    n_out = t_out.shape[0]

    for t in prange(n_out):
        time_register = t_out[t]

        # Grab the top bits as an index to the input buffer
        n = int(time_register)

        # Grab the fractional component of the time index
        frac = scale * (time_register - n)

        # Offset into the filter
        index_frac = frac * num_table
        offset = int(index_frac)

        # Interpolation factor
        eta = index_frac - offset

        # Compute the left wing of the filter response
        i_max = min(n + 1, (nwin - offset) // index_step)
        for i in range(i_max):

            weight = (
                interp_win[offset + i * index_step]
                + eta * interp_delta[offset + i * index_step]
            )
            y[t] += weight * x[n - i]

        # Invert P
        frac = scale - frac

        # Offset into the filter
        index_frac = frac * num_table
        offset = int(index_frac)

        # Interpolation factor
        eta = index_frac - offset

        # Compute the right wing of the filter response
        k_max = min(n_orig - n - 1, (nwin - offset) // index_step)
        for k in range(k_max):
            weight = (
                interp_win[offset + k * index_step]
                + eta * interp_delta[offset + k * index_step]
            )
            y[t] += weight * x[n + k + 1]

_resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop)

@guvectorize(
    [
        (int16[:], float64[:], float64[:], float64[:], int32, float32, int16[:]),
        (int32[:], float64[:], float64[:], float64[:], int32, float32, int32[:]),
        (int64[:], float64[:], float64[:], float64[:], int32, float32, int64[:]),
        (float32[:], float64[:], float64[:], float64[:], int32, float32, float32[:]),
        (float64[:], float64[:], float64[:], float64[:], int32, float32, float64[:]),
        (
            complex64[:],
            float64[:],
            float64[:],
            float64[:],
            int32,
            float32,
            complex64[:],
        ),
        (
            complex128[:],
            float64[:],
            float64[:],
            float64[:],
            int32,
            float32,
            complex128[:],
        ),
    ],
    "(n),(m),(p),(p),(),()->(m)",
    nopython=True,
)
def resample_f_p(x, t_out, interp_win, interp_delta, num_table, scale, y):
    _resample_loop_p(x, t_out, interp_win, interp_delta, num_table, scale, y)

class DummyModel(nn.Module):
    def __init__(self, n_layers, dim):
        super().__init__()
        self.layers = [nn.Linear(dim, dim) for _ in range(n_layers)]
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        for layer in self.layers:
            hidden_states = layer(hidden_states)
            hidden_states = self.activation(hidden_states)

        return hidden_states

if __name__ == "__main__":
    batch_size = 2
    seq_len = 16384

    # dim = 128
    # model = DummyModel(n_layers=16, dim=dim)
    # input = torch.ones(batch_size, seq_len, dim)
    #
    # for i in tqdm(range(16)):
    #     output = model(input)
    #     print(output.shape)

    from transformers import AutoModelForSpeechSeq2Seq
    model = AutoModelForSpeechSeq2Seq.from_pretrained("hf-internal-testing/tiny-random-speech-encoder-decoder")
    encoder_input = torch.ones(batch_size, seq_len, dtype=torch.float32)
    decoder_input = torch.ones(batch_size, 128, dtype=torch.int32)

    for i in tqdm(range(16)):
        output = model(inputs=encoder_input, decoder_input_ids=decoder_input)
        print(output.logits.shape)

ydshieh commented 2 years ago

Actually, I also observe (some) slowdown on other environments. For example on GCP VM,

parallel=True: 30.0it/s
parallel=False: 50.0it/s

(but this doesn't cause real problem, as they are both fast enough)

bmcfee commented 2 years ago

Thanks @ydshieh for the detailed report!

It doesn't matter if the methods are called or not. As long as they are imported, the issue occurs.

This to me says that the issue is the jit-compile time, rather than the run time. I wonder if adding cache=True here could help resolve things?

bmcfee commented 2 years ago

We have some confirmation over in librosa that this is indeed a jit compilation issue and not a runtime issue per se.

Unfortunately, adding cache=True to the build to amortize this cost triggers a segfault in numba across sessions. This is apparently a known and long-standing issue https://github.com/numba/numba/issues/4807

This is where it gets a little dicey from our side:

We can cache the serial implementation resample_f_s without issue. Caching on resample_f_p breaks.
We cannot cache the build of the inner loop function that is common to both. This is because numba currently does not use compilation flags when hashing the function, so enabling cache=True on these would result in either the parallel implementation being used for the serial or vice versa. Definitely not what we want. This too is a known issue: https://github.com/numba/numba/issues/6845

(1) means that we always have to pay a compilation cost for the parallel import. Now, we could get around this by breaking into submodules and using some lazy-import tricks to defer the parallel build until it's absolutely necessary, but that makes me a bit nervous.

(2) means that at present, we have to pay the compilation cost twice on import until the upstream issue is resolved. I don't see a clean way around this, but I have a branch working right now where the inner loop function is compiled and cached twice using deep copies to avoid collisions, and that looks like it might be working.

ydshieh commented 2 years ago

@bmcfee Thank you for looking into this. I am not very familiar with these, but I have one question:

If you look the To reproduce (running on CircleCI docker executor) section above, you will see that no resample_XXX function is used at all while measuring the timing. There is _resample_loop_p = jit(...), but it is done before the time measurement. However, if jit with parallel=True + guvectorize, it takes much long time to finish the loop.

So I don't think the issue (just) comes from compilation cost.

bmcfee commented 2 years ago

There is _resample_loop_p = jit(...), but it is done before the time measurement. However, if jit with parallel=True + guvectorize, it takes much long time to finish the loop.

So I don't think the issue (just) comes from compilation cost.

Well that is downright baffling, and makes me suspect that something is going very wrong either in torch or in numba/llvmlite.

If the function isn't being called at all, compilation is the only thing we have left. Are you absolutely certain that something isn't calling the compiled interpolator function? Sorry, I have no idea what's going on in the transformers object hierarchy, and really don't have the bandwidth to dig into it further.

mthrok commented 2 years ago

Hi

We are observing the same issue in torchaudio CI. resampy 0.3.0 or 0.2.2 works fine but resampy 0.3.1 causes timeout in tutorial generation job.

Failure with resampy 0.3.1 https://app.circleci.com/pipelines/github/pytorch/audio/11789/workflows/4f6d4b3d-5cab-4d9c-b9d9-2b2a7a3ebeb0/jobs/802671
Successful run after pinning resampy 0.2.2. https://app.circleci.com/pipelines/github/pytorch/audio/11801/workflows/69164737-8627-464f-bb62-3ee1beb89163/jobs/804295?invite=true#step-106-241

The tutorial experiencing the timeout does not use resampy. It does not even import it. However, the tutorial executed before the failing one uses librosa.power_to_db. So there seems to be some hysteresis. The librosa version is 0.9.2 in any cases.

All of build_docs job in build workflow after July 7th are failing. https://app.circleci.com/pipelines/github/pytorch/audio?branch=nightly

The execution stack looks like

CircleCI
- Docker
- Make
  - conda
  - sphinx_gallery
    - Python
The tutorial experiencing timeout is https://github.com/pytorch/audio/blob/main/examples/tutorials/online_asr_tutorial.py
The tutorial executed before is https://github.com/pytorch/audio/blob/main/examples/tutorials/audio_feature_augmentation_tutorial.py which uses librosa.power_to_db.

The particular code exhibiting timeout is our custom C++ operation, and this code is calling FFmpeg to fetch MP3 from AWS. The C++ code is bounded to Python using PyTorch, but not sure if it's related.

ydshieh commented 2 years ago

Are you absolutely certain that something isn't calling the compiled interpolator function?

I will try to find an example which only uses simple torch operations, or see if the the compiled function is called somehow.

bmcfee commented 2 years ago

Just as a heads up, I won't have time to look at this more until next week due to deadlines and travel. But I'm sure we'll get to the bottom of it.

bmcfee commented 2 years ago

I can only observe this issue on CircleCI docker executor.

I'm starting to wonder if docker is a necessary ingredient here - a similar issue was reported in librosa recently https://github.com/librosa/librosa/issues/1541 , note the following comment in the issue:

I'm running things inside of a Docker container running heroku/heroku:20 (based off Ubuntu 20). If I run import librosa from the host computer running macOS the speed is normal. The Mac is a 2021 MacBook Pro M1.

Now, I'm pretty certain that the librosa issue is actually a duplicate of this issue, but I mention it because it specifically mentions that taking docker out of the equation does resolve the problem.

The fact that resampy 0.3.0 does not seem to trigger this points to a specific interaction between docker and @guvectorize (added in 0.3.1), but not parallel compilation (added in 0.3.0).

ydshieh commented 2 years ago

Hi, yes it seems to relate to docker. More precisely, I only see this issue occurs on CircleCI docker executor.

same docker image on GCP VM -> no issue
Linux VM executor (no docker) on CircleCI -> no issue

(also mentioned this in some previous comments)

bmcfee commented 2 years ago

Interesting - worth noting that it can't only be circleCI, since the librosa issue is reported on heroku, but maybe there's some common factor to how both of those platforms use docker that isn't shared by GCP.

I think this is beginning to smell like an upstream issue in numba (or maybe llvmlite), and not something due to how resampy is using numba.

ydshieh commented 2 years ago

@bmcfee I am able to reproduce this issue with simple torch ops

without parallel=False (and with guvectorize)
- job run link
- nn.Embedding: 1e-5 ~ 5e-5 second
- +: 1e-5 ~ 1e-4 second
with parallel=True (and with guvectorize)
- job run link
- nn.Embedding: 5e-2 second
- +: 5e-2 second

This is 1000x slowdown!

Would you like to provide a simple version regarding _resample_loop (could be any dummy function) and guvectorize signature, and open an issue either in numba, llvm or torch? (You definitely know much better than me on these things). Thank you. (Or we can just use the whole example I provide here)

To reproduce

change the __main__ in this comment to the following


if __name__ == "__main__":

    batch_size = 64
    seq_len = 128

    vocab_size = 1024
    hidden_size = 32
    layer = nn.Embedding(vocab_size, hidden_size)

    inputs = torch.ones(batch_size, seq_len, dtype=torch.int32)

    for i in range(10):
        output = layer(inputs)

    for i in range(10):
        s = datetime.datetime.now()
        for j in range(100):
            output = layer(inputs)
        e = datetime.datetime.now()
        print(f"nn.Embedding ({i}): {(e - s).total_seconds() / 100} seconds")
    print("=" * 40)

    x = torch.ones(64, 128, 32, dtype=torch.float32)
    y = torch.ones(64, 128, 32, dtype=torch.float32)

    for i in range(10):
        z = x + y

    for i in range(10):
        s = datetime.datetime.now()
        for j in range(100):
            z = x + y
        e = datetime.datetime.now()
        print(f"z = x + y ({i}): {(e - s).total_seconds() / 100} seconds")
    print("=" * 40)

    q = torch.ones(64, 4, 128, 8, dtype=torch.float32)
    k = torch.ones(64, 4, 8, 128, dtype=torch.float32)

    for i in range(10):
        attention_scores = torch.matmul(q, k)

    for i in range(10):
        s = datetime.datetime.now()
        for j in range(100):
            attention_scores = torch.matmul(q, k)
        e = datetime.datetime.now()
        print(f"torch.matmul ({i}): {(e - s).total_seconds() / 100} seconds")
    print("=" * 40)

bmcfee commented 2 years ago

Would you like to provide a simple version regarding _resample_loop (could be any dummy function) and guvectorize signature, and open an issue either in numba, llvm or torch? (You definitely know much better than me on these things). Thank you. (Or we can just use the whole example I provide here)

I was able to run this script, and verify that I don't see any slowdown on my dev machine (not using docker). I think that before raising an issue upstream, we should try to reproduce this outside of circleci, just to eliminate other sources of variation on shared machines.

Do you know if it's possible to pull down the docker image you're using so we can try it locally?

ydshieh commented 2 years ago

It is cimg/python:3.7.12.

However, running that image with GCP VM didn't produce issue (I tried with some previous test code). ~~I can try the above latest code snippet with docker on GCP.~~ --> Checked, same image running on GCP VM: no issue

In fact, I think it is not the image issue, it is probably the way docker server and docker image is launched on CircleCI.

bmcfee commented 2 years ago

In fact, I think it is not the image issue, it is probably the way docker server and docker image is launched on CircleCI.

Do we know anything more about how circle and gcp differ in this?

ydshieh commented 2 years ago

Do we know anything more about how circle and gcp differ in this?

Not from me, unfortunately. However, It would be great if the author of the librosa issue could try my code snippet on heroku too and see if the slowdown also occurs.

ydshieh commented 2 years ago

Looks like with torch 1.12, this issue disappear.

See

1.11 run https://app.circleci.com/pipelines/github/huggingface/transformers/44796/workflows/b6f297e2-c08c-4b39-8a5d-306f6f16304b

1.12 run https://app.circleci.com/pipelines/github/huggingface/transformers/44794/workflows/52c5b23a-12fb-461d-aea4-0673b99a9e47

bmcfee commented 2 years ago

:face_with_spiral_eyes:

Now i'm really confused, because the issue over in librosa did not have torch involved at all.

mthrok commented 2 years ago

Sorry to add more confusing information, but we observed this with PyTorch 1.12.0 as well in TorchAudio release branch.

The failing job https://app.circleci.com/pipelines/github/pytorch/audio/11756/workflows/84b6bb5e-47f5-4494-9663-19f045180d61/jobs/797806

Pinning resampy 0.2.2 fixed the issue.

ydshieh commented 2 years ago

😵‍💫

Now i'm really confused, because the issue over in librosa did not have torch involved at all.

Let me run a few more times. I am worried that my previous statement might be False

bmcfee commented 2 years ago

ok, well nice to know that the torch update is (might be?) orthogonal to this issue.

bmcfee commented 2 years ago

Let me run a few more times. I am worried that my previous statement might be False

@ydshieh any confirmation on this? I have some cycles to work on this issue now.

ydshieh commented 2 years ago

I have enough evidence to say yes. More precisely, the script I provided is slow on torch-1.12.0.dev20220216%2Bcpu-cp37-cp37m-linux_x86_64.whl but not on torch-1.12.0.dev20220217%2Bcpu-cp37-cp37m-linux_x86_64.whl. (Of course, all statements are for running on CircleCI with docker executor).

I do have 2 or 3 times not seeing the issue under the same condition.

If you are interested, here is the run page. You can just check the 4 latest runs.

I am trying to build torch from source for the commits between these 2 dates. Hope we can see which change/commit solves the issue.

I feel that the issue opened on librosa is different from the one I face. Here we don't have very slow import, but instead having some ops slowing down once resampy 0.3.1 is imported.

bmcfee commented 2 years ago

I have enough evidence to say yes. More precisely, the script I provided is slow on torch-1.12.0.dev20220216%2Bcpu-cp37-cp37m-linux_x86_64.whl but not on torch-1.12.0.dev20220217%2Bcpu-cp37-cp37m-linux_x86_64.whl. (Of course, all statements are for running on CircleCI with docker executor).

I do have 2 or 3 times not seeing the issue under the same condition.

Oof. How hard would it be to get a diff of those two versions? My hunch here is that none of this has anything to do with changes in pytorch, and might just come down to inherent randomness of running on a shared host. I'd love to be wrong about that though!

I feel that the issue opened on librosa is different from the one I face. Here we don't have very slow import, but instead having some ops slowing down once resampy 0.3.1 is imported.

Good observation. The slow import might be due to all the build targets in guvectorize that hit at import-time (rather than JIT). I have a local branch that can bring that down substantially by dropping the explicit build targets. The implications of that are rather involved though, so I'll migrate it to another issue.

ydshieh commented 2 years ago

There are about 120 commits that day in torch, and look like a lot of files changed.

I also think it's possible from the randomness of different host.

bmcfee commented 2 years ago

There are about 120 commits that day in torch, and look like a lot of files changed.

Looking through the log now, this looks like the only reasonable culprit IMO:

https://github.com/pytorch/pytorch/commit/17b3ba148

I could definitely see potential issues stemming from MKL interacting badly with (other) parallel code.

ydshieh commented 2 years ago

You are really good at spotting this - I can double check with the builds from source, but probably not today.

bmcfee commented 2 years ago

You are really good at spotting this

:grin: git whatchanged --since="2022-02-16" --until="2022-02-17" is a life-saver. Also having been bitten by silent MKL quirks in the past.

bmcfee commented 2 years ago

Just as a heads up, PR #109 implements a handful of changes as mentioned in #108, and also reverts parallel=False by default. I'm not sure this directly resolves the potential MKL conflict / docker issue, but if you could give that a shot and let me know, it would help prioritize putting out a bugfix release next week.

ydshieh commented 2 years ago

OK!

ydshieh commented 2 years ago

The PR makes my script run fast independent of the torch versions.

bmcfee commented 2 years ago

The PR makes my script run fast independent of the torch versions.

Great, thanks for confirming. I'll merge that PR and cut the 0.4 release on monday.

Path-A commented 2 years ago

This may be tangentially related, but I've also isolated a huge slowdown to our scripts that import librosa with resampy 0.3.1 on our Coral Dev Board.

The result of sudo python3 -X importtime -c "import librosa" has 92975737 microseconds (~93 seconds) of the import taken up by resampy.interpn . After downgrading resampy to 0.2.2, resampy.interpn only takes 3066 microseconds (~0.003 seconds) of the import.

To be more transparent about system:

Coral Dev Board (Mendel Linux 5.3) w/ LLVM 7
audioread 2.1.9
joblib 1.1.0
librosa 0.8.1
llvmlite 0.32.0
numba 0.49.0
numpy 1.16.2
resampy 0.3.1
scikit-learn 0.20.2
scipy 1.1.0
SoundFile 0.10.3.post1

bmcfee commented 2 years ago

@Path-A that's indeed a separate problem, and resolved by PR #109

bmcfee commented 2 years ago

Ok, the 0.4.0 release is up on pypi and making its way through conda-forge now. I'll close this issue out, but please reopen it if problems recur with the new release.

bmcfee / resampy