Closed sgugger closed 2 years ago
That's deeply weird. If you can't provide a reproducing example, can you at least describe more about what the environment that triggers slow-down looks like?
Nothing in librosa has changed relative to the resampy updates.
The only thing that comes to mind is that resampy 0.3.1 has a bit more going on at compilation time than 0.3.0 (different dtype targets are explicitly listed), but a 20x slowdown in unrelated code makes zeros sense to me.
We observed the slowdown in circleCI while running our tests, the whole instructions are in this circleCI config file. I realize it may be difficult to reproduce, but basically all jobs that install the torch-speech
dependency were getting really slow, so anything like:
pip install transformers[sklearn,torch,testing,sentencepiece,torch-speech]
Our best guess is that it somehow changed things in Numba/NumPy that then reverted to a slow version, but nothing more concrete than that.
Our best guess is that it somehow changed things in Numba/NumPy that then reverted to a slow version, but nothing more concrete than that.
Strange. The dependencies have not changed from 0.3.0 to 0.3.1, so I don't see why that would happen.
The only thing that changed was how we handle multi-channel resampling (see PR #104 ) which now uses numba generalized ufunc vectorization to bring down the interpolation loop to operate independently for each channel. In moving to @guvectorize
, I had to explicitly set out the dtype targets, which I could envision adding a little overhead, but only when the functions are actually called (and then, only the first time).
The only time this should actually have an observable effect is when resampling multichannel signals, and then, it should be strictly better in 0.3.1 (see #102 for discussion). If you're not doing multi-channel, performance should be equivalent (and match or beat that of the 0.2 series).
The other thing that could be happening here is that resampy is going parallel and interfering with other processes in your CI. You could check this quickly by limiting the number of threads that numba can take (see docs) in your environment. At present, resampy exposes an interface to disable parallelism, but librosa does not (see this issue for what needs to happen there), so a global override is probably your best strategy for determining if this is the culprit if you're only accessing resampy through librosa.
@sgugger curious if you've made any headway on this?
We encountered the same weird problem ...
The inference speed of the transformer will slow down with the resampy 0.3.1.
Downgrading the resampy ( 0.3.1 -> 0.2.2) solved the problem.
@vBaiCai thanks - can you provide any kind of hints for reproducing this behavior though?
I tried to reproduce, even using CircleCI's convenient docker image running on a GCP VM which has no issue. So far I can only see this issue when the code is running on CircleCI (with docker executor).
@vBaiCai: do you also see this issue on CircleCI, or some other environments?
The issue occurs after the commit a36275670f284e8f09d71ef3515dc7513aeba6e6
(fixed #102, fast resampling in high dimensions).
With some more investigation, here is my finding
parallel=True
in _resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop)
causes the problem.
@guvectorize
(for resample_f_p
), no slow down.parallel=False
, no slow down._resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop)
:
_resample_loop_p = jit(nopython=True, nogil=True, parallel=False)(_resample_loop)
:
import torch
from torch import nn
from tqdm import tqdm
"""Numba implementation of resampler"""
from numba import (
guvectorize,
float32,
float64,
jit,
prange,
int16,
int32,
int64,
complex64,
complex128,
)
def _resample_loop(x, t_out, interp_win, interp_delta, num_table, scale, y):
index_step = int(scale * num_table)
time_register = 0.0
n = 0
frac = 0.0
index_frac = 0.0
offset = 0
eta = 0.0
weight = 0.0
nwin = interp_win.shape[0]
n_orig = x.shape[0]
n_out = t_out.shape[0]
for t in prange(n_out):
time_register = t_out[t]
# Grab the top bits as an index to the input buffer
n = int(time_register)
# Grab the fractional component of the time index
frac = scale * (time_register - n)
# Offset into the filter
index_frac = frac * num_table
offset = int(index_frac)
# Interpolation factor
eta = index_frac - offset
# Compute the left wing of the filter response
i_max = min(n + 1, (nwin - offset) // index_step)
for i in range(i_max):
weight = (
interp_win[offset + i * index_step]
+ eta * interp_delta[offset + i * index_step]
)
y[t] += weight * x[n - i]
# Invert P
frac = scale - frac
# Offset into the filter
index_frac = frac * num_table
offset = int(index_frac)
# Interpolation factor
eta = index_frac - offset
# Compute the right wing of the filter response
k_max = min(n_orig - n - 1, (nwin - offset) // index_step)
for k in range(k_max):
weight = (
interp_win[offset + k * index_step]
+ eta * interp_delta[offset + k * index_step]
)
y[t] += weight * x[n + k + 1]
_resample_loop_p = jit(nopython=True, nogil=True, parallel=True)(_resample_loop)
@guvectorize(
[
(int16[:], float64[:], float64[:], float64[:], int32, float32, int16[:]),
(int32[:], float64[:], float64[:], float64[:], int32, float32, int32[:]),
(int64[:], float64[:], float64[:], float64[:], int32, float32, int64[:]),
(float32[:], float64[:], float64[:], float64[:], int32, float32, float32[:]),
(float64[:], float64[:], float64[:], float64[:], int32, float32, float64[:]),
(
complex64[:],
float64[:],
float64[:],
float64[:],
int32,
float32,
complex64[:],
),
(
complex128[:],
float64[:],
float64[:],
float64[:],
int32,
float32,
complex128[:],
),
],
"(n),(m),(p),(p),(),()->(m)",
nopython=True,
)
def resample_f_p(x, t_out, interp_win, interp_delta, num_table, scale, y):
_resample_loop_p(x, t_out, interp_win, interp_delta, num_table, scale, y)
class DummyModel(nn.Module):
def __init__(self, n_layers, dim):
super().__init__()
self.layers = [nn.Linear(dim, dim) for _ in range(n_layers)]
self.activation = nn.Tanh()
def forward(self, hidden_states):
for layer in self.layers:
hidden_states = layer(hidden_states)
hidden_states = self.activation(hidden_states)
return hidden_states
if __name__ == "__main__":
batch_size = 2
seq_len = 16384
# dim = 128
# model = DummyModel(n_layers=16, dim=dim)
# input = torch.ones(batch_size, seq_len, dim)
#
# for i in tqdm(range(16)):
# output = model(input)
# print(output.shape)
from transformers import AutoModelForSpeechSeq2Seq
model = AutoModelForSpeechSeq2Seq.from_pretrained("hf-internal-testing/tiny-random-speech-encoder-decoder")
encoder_input = torch.ones(batch_size, seq_len, dtype=torch.float32)
decoder_input = torch.ones(batch_size, 128, dtype=torch.int32)
for i in tqdm(range(16)):
output = model(inputs=encoder_input, decoder_input_ids=decoder_input)
print(output.logits.shape)
Actually, I also observe (some) slowdown on other environments. For example on GCP VM,
parallel=True
: 30.0it/sparallel=False
: 50.0it/s(but this doesn't cause real problem, as they are both fast enough)
Thanks @ydshieh for the detailed report!
It doesn't matter if the methods are called or not. As long as they are imported, the issue occurs.
This to me says that the issue is the jit-compile time, rather than the run time. I wonder if adding cache=True
here could help resolve things?
We have some confirmation over in librosa that this is indeed a jit compilation issue and not a runtime issue per se.
Unfortunately, adding cache=True
to the build to amortize this cost triggers a segfault in numba across sessions. This is apparently a known and long-standing issue https://github.com/numba/numba/issues/4807
This is where it gets a little dicey from our side:
resample_f_s
without issue. Caching on resample_f_p
breaks.cache=True
on these would result in either the parallel implementation being used for the serial or vice versa. Definitely not what we want. This too is a known issue: https://github.com/numba/numba/issues/6845(1) means that we always have to pay a compilation cost for the parallel import. Now, we could get around this by breaking into submodules and using some lazy-import tricks to defer the parallel build until it's absolutely necessary, but that makes me a bit nervous.
(2) means that at present, we have to pay the compilation cost twice on import until the upstream issue is resolved. I don't see a clean way around this, but I have a branch working right now where the inner loop function is compiled and cached twice using deep copies to avoid collisions, and that looks like it might be working.
@bmcfee Thank you for looking into this. I am not very familiar with these, but I have one question:
If you look the To reproduce (running on CircleCI docker executor)
section above, you will see that no resample_XXX
function is used at all while measuring the timing. There is _resample_loop_p = jit(...)
, but it is done before the time measurement. However, if jit with parallel=True
+ guvectorize
, it takes much long time to finish the loop.
So I don't think the issue (just) comes from compilation cost.
There is
_resample_loop_p = jit(...)
, but it is done before the time measurement. However, ifjit with parallel=True
+guvectorize
, it takes much long time to finish the loop.So I don't think the issue (just) comes from compilation cost.
Well that is downright baffling, and makes me suspect that something is going very wrong either in torch or in numba/llvmlite.
If the function isn't being called at all, compilation is the only thing we have left. Are you absolutely certain that something isn't calling the compiled interpolator function? Sorry, I have no idea what's going on in the transformers object hierarchy, and really don't have the bandwidth to dig into it further.
Hi
We are observing the same issue in torchaudio CI. resampy 0.3.0 or 0.2.2 works fine but resampy 0.3.1 causes timeout in tutorial generation job.
Failure with resampy 0.3.1 https://app.circleci.com/pipelines/github/pytorch/audio/11789/workflows/4f6d4b3d-5cab-4d9c-b9d9-2b2a7a3ebeb0/jobs/802671
Successful run after pinning resampy 0.2.2. https://app.circleci.com/pipelines/github/pytorch/audio/11801/workflows/69164737-8627-464f-bb62-3ee1beb89163/jobs/804295?invite=true#step-106-241
The tutorial experiencing the timeout does not use resampy
. It does not even import it. However, the tutorial executed before the failing one uses librosa.power_to_db
. So there seems to be some hysteresis.
The librosa
version is 0.9.2
in any cases.
All of build_docs
job in build
workflow after July 7th are failing. https://app.circleci.com/pipelines/github/pytorch/audio?branch=nightly
The execution stack looks like
CircleCI
The tutorial experiencing timeout is https://github.com/pytorch/audio/blob/main/examples/tutorials/online_asr_tutorial.py
The tutorial executed before is https://github.com/pytorch/audio/blob/main/examples/tutorials/audio_feature_augmentation_tutorial.py which uses librosa.power_to_db
.
The particular code exhibiting timeout is our custom C++ operation, and this code is calling FFmpeg to fetch MP3 from AWS. The C++ code is bounded to Python using PyTorch, but not sure if it's related.
Are you absolutely certain that something isn't calling the compiled interpolator function?
I will try to find an example which only uses simple torch operations, or see if the the compiled function is called somehow.
Just as a heads up, I won't have time to look at this more until next week due to deadlines and travel. But I'm sure we'll get to the bottom of it.
I can only observe this issue on CircleCI docker executor.
I'm starting to wonder if docker is a necessary ingredient here - a similar issue was reported in librosa recently https://github.com/librosa/librosa/issues/1541 , note the following comment in the issue:
I'm running things inside of a Docker container running heroku/heroku:20 (based off Ubuntu 20). If I run import librosa from the host computer running macOS the speed is normal. The Mac is a 2021 MacBook Pro M1.
Now, I'm pretty certain that the librosa issue is actually a duplicate of this issue, but I mention it because it specifically mentions that taking docker out of the equation does resolve the problem.
The fact that resampy 0.3.0 does not seem to trigger this points to a specific interaction between docker and @guvectorize
(added in 0.3.1), but not parallel compilation (added in 0.3.0).
Hi, yes it seems to relate to docker. More precisely, I only see this issue occurs on CircleCI docker executor.
(also mentioned this in some previous comments)
Interesting - worth noting that it can't only be circleCI, since the librosa issue is reported on heroku, but maybe there's some common factor to how both of those platforms use docker that isn't shared by GCP.
I think this is beginning to smell like an upstream issue in numba (or maybe llvmlite), and not something due to how resampy is using numba.
@bmcfee I am able to reproduce this issue with simple torch ops
without parallel=False
(and with guvectorize
)
nn.Embedding
: 1e-5 ~ 5e-5
second+
: 1e-5 ~ 1e-4
secondwith parallel=True
(and with guvectorize
)
nn.Embedding
: 5e-2
second+
: 5e-2
secondThis is 1000x slowdown!
Would you like to provide a simple version regarding _resample_loop
(could be any dummy function) and guvectorize
signature, and open an issue either in numba, llvm or torch? (You definitely know much better than me on these things). Thank you. (Or we can just use the whole example I provide here)
change the __main__
in this comment to the following
if __name__ == "__main__":
batch_size = 64
seq_len = 128
vocab_size = 1024
hidden_size = 32
layer = nn.Embedding(vocab_size, hidden_size)
inputs = torch.ones(batch_size, seq_len, dtype=torch.int32)
for i in range(10):
output = layer(inputs)
for i in range(10):
s = datetime.datetime.now()
for j in range(100):
output = layer(inputs)
e = datetime.datetime.now()
print(f"nn.Embedding ({i}): {(e - s).total_seconds() / 100} seconds")
print("=" * 40)
x = torch.ones(64, 128, 32, dtype=torch.float32)
y = torch.ones(64, 128, 32, dtype=torch.float32)
for i in range(10):
z = x + y
for i in range(10):
s = datetime.datetime.now()
for j in range(100):
z = x + y
e = datetime.datetime.now()
print(f"z = x + y ({i}): {(e - s).total_seconds() / 100} seconds")
print("=" * 40)
q = torch.ones(64, 4, 128, 8, dtype=torch.float32)
k = torch.ones(64, 4, 8, 128, dtype=torch.float32)
for i in range(10):
attention_scores = torch.matmul(q, k)
for i in range(10):
s = datetime.datetime.now()
for j in range(100):
attention_scores = torch.matmul(q, k)
e = datetime.datetime.now()
print(f"torch.matmul ({i}): {(e - s).total_seconds() / 100} seconds")
print("=" * 40)
Would you like to provide a simple version regarding
_resample_loop
(could be any dummy function) andguvectorize
signature, and open an issue either in numba, llvm or torch? (You definitely know much better than me on these things). Thank you. (Or we can just use the whole example I provide here)
I was able to run this script, and verify that I don't see any slowdown on my dev machine (not using docker). I think that before raising an issue upstream, we should try to reproduce this outside of circleci, just to eliminate other sources of variation on shared machines.
Do you know if it's possible to pull down the docker image you're using so we can try it locally?
It is cimg/python:3.7.12
.
However, running that image with GCP VM didn't produce issue (I tried with some previous test code). I can try the above latest code snippet with docker on GCP. --> Checked, same image running on GCP VM: no issue
In fact, I think it is not the image issue, it is probably the way docker server and docker image is launched on CircleCI.
In fact, I think it is not the image issue, it is probably the way docker server and docker image is launched on CircleCI.
Do we know anything more about how circle and gcp differ in this?
Do we know anything more about how circle and gcp differ in this?
Not from me, unfortunately. However, It would be great if the author of the librosa issue could try my code snippet on heroku
too and see if the slowdown also occurs.
Looks like with torch 1.12, this issue disappear.
See
:face_with_spiral_eyes:
Now i'm really confused, because the issue over in librosa did not have torch involved at all.
Sorry to add more confusing information, but we observed this with PyTorch 1.12.0 as well in TorchAudio release branch.
The failing job https://app.circleci.com/pipelines/github/pytorch/audio/11756/workflows/84b6bb5e-47f5-4494-9663-19f045180d61/jobs/797806
Pinning resampy 0.2.2 fixed the issue.
😵💫
Now i'm really confused, because the issue over in librosa did not have torch involved at all.
Let me run a few more times. I am worried that my previous statement might be False
ok, well nice to know that the torch update is (might be?) orthogonal to this issue.
Let me run a few more times. I am worried that my previous statement might be False
@ydshieh any confirmation on this? I have some cycles to work on this issue now.
I have enough evidence to say yes. More precisely, the script I provided is slow on torch-1.12.0.dev20220216%2Bcpu-cp37-cp37m-linux_x86_64.whl
but not on torch-1.12.0.dev20220217%2Bcpu-cp37-cp37m-linux_x86_64.whl
.
(Of course, all statements are for running on CircleCI with docker executor).
I do have 2 or 3 times not seeing the issue under the same condition.
If you are interested, here is the run page. You can just check the 4 latest runs.
I am trying to build torch from source for the commits between these 2 dates. Hope we can see which change/commit solves the issue.
I feel that the issue opened on librosa is different from the one I face. Here we don't have very slow import, but instead having some ops slowing down once resampy 0.3.1 is imported.
I have enough evidence to say yes. More precisely, the script I provided is slow on
torch-1.12.0.dev20220216%2Bcpu-cp37-cp37m-linux_x86_64.whl
but not ontorch-1.12.0.dev20220217%2Bcpu-cp37-cp37m-linux_x86_64.whl
. (Of course, all statements are for running on CircleCI with docker executor).I do have 2 or 3 times not seeing the issue under the same condition.
Oof. How hard would it be to get a diff of those two versions? My hunch here is that none of this has anything to do with changes in pytorch, and might just come down to inherent randomness of running on a shared host. I'd love to be wrong about that though!
I feel that the issue opened on librosa is different from the one I face. Here we don't have very slow import, but instead having some ops slowing down once resampy 0.3.1 is imported.
Good observation. The slow import might be due to all the build targets in guvectorize that hit at import-time (rather than JIT). I have a local branch that can bring that down substantially by dropping the explicit build targets. The implications of that are rather involved though, so I'll migrate it to another issue.
There are about 120 commits that day in torch
, and look like a lot of files changed.
I also think it's possible from the randomness of different host.
There are about 120 commits that day in
torch
, and look like a lot of files changed.
Looking through the log now, this looks like the only reasonable culprit IMO:
https://github.com/pytorch/pytorch/commit/17b3ba148
I could definitely see potential issues stemming from MKL interacting badly with (other) parallel code.
You are really good at spotting this - I can double check with the builds from source, but probably not today.
You are really good at spotting this
:grin: git whatchanged --since="2022-02-16" --until="2022-02-17"
is a life-saver. Also having been bitten by silent MKL quirks in the past.
Just as a heads up, PR #109 implements a handful of changes as mentioned in #108, and also reverts parallel=False by default. I'm not sure this directly resolves the potential MKL conflict / docker issue, but if you could give that a shot and let me know, it would help prioritize putting out a bugfix release next week.
OK!
The PR makes my script run fast independent of the torch versions.
The PR makes my script run fast independent of the torch versions.
Great, thanks for confirming. I'll merge that PR and cut the 0.4 release on monday.
This may be tangentially related, but I've also isolated a huge slowdown to our scripts that import librosa with resampy 0.3.1 on our Coral Dev Board.
The result of sudo python3 -X importtime -c "import librosa"
has 92975737 microseconds (~93 seconds) of the import taken up by resampy.interpn . After downgrading resampy to 0.2.2, resampy.interpn only takes 3066 microseconds (~0.003 seconds) of the import.
To be more transparent about system:
Coral Dev Board (Mendel Linux 5.3) w/ LLVM 7
audioread 2.1.9
joblib 1.1.0
librosa 0.8.1
llvmlite 0.32.0
numba 0.49.0
numpy 1.16.2
resampy 0.3.1
scikit-learn 0.20.2
scipy 1.1.0
SoundFile 0.10.3.post1
@Path-A that's indeed a separate problem, and resolved by PR #109
Ok, the 0.4.0 release is up on pypi and making its way through conda-forge now. I'll close this issue out, but please reopen it if problems recur with the new release.
Hi there!
We just debugged a problem on the Transformers CI that was suddenly taking way longer than before (by a x20 factor) and it all came down to the new release of resampy (pinning to 0.3.0 fixed the issue). We have an indirect dependency via librosa. The strange things is that it makes tests that do not use resampy at all way way slower. Are you changing things in other libraries maybe?
I don't have a nice reproducer for you, sorry, but just wanted to let you know!