Audio waveform generation can exceed gunicorn worker timeout

zackkrida commented 1 year ago

Description

Sentry link: https://openverse.sentry.io/share/issue/d29e6afe3ab44db1938b47ddc587e1ac/

The timeout for our API workers is 30 seconds. Occasionally, the waveform generation:

https://github.com/WordPress/openverse/blob/316b59caebe44aa06a3eeea41c4fd4e62bec768d/api/api/utils/waveform.py#L151-L159

Exceeds this timeout and causes the worker to abort on the waveform views:

https://github.com/WordPress/openverse/blob/35100d3c719e142da2012303a5f2790c637f737a/api/api/views/audio_views.py#L77-L100

Potential solution

We wrap part of this code in a timeout, something like timeout-decorator^1:

@timeout(15)
def generate_peaks(audio) -> list[float]:
 # ...

Then, if the waveform generation fails, we store the audio UUID in redis. Finally, we could write a DAG which processes all the failed waveform generations at some scheduled interval in a manner which doesn't have a 30s timeout.

@WordPress/openverse-maintainers what do you think about the priority here? I've marked it as high initially because these SystemExits are quite bad. However, this one has only happened twice in the last 30 days.

sarayourfriend commented 1 year ago

The temporary solution you've proposed makes sense. I think that long term having a separate service that can be configured with a more lenient Gunicorn (or whatever worker runner) timeout would be better.

Then, if the waveform generation fails, we store the audio UUID in redis. Finally, we could write a DAG which processes all the failed waveform generations at some scheduled interval in a manner which doesn't have a 30s timeout.

This feels like a separate issue, maybe one that would warrant a small implementation plan. To get around the gunicorn 30 second timeout without changing that timeout we'd need to run waveform generation outside the API (like in Airflow I guess) which would be sufficiently complex to warrant its own IP.

I'd also encourage us to use a higher timeout on the route than 15 seconds. 15 seconds cuts off anything that would take longer but still be within the Gunicorn timeout. We're using a 25-second total timeout on the thumbnail view requests to leave 5 seconds for Python and Django to do things, so we could apply that same idea here and use 25 on the waveform generation function.

I've marked it as high initially because these SystemExits are quite bad. However, this one has only happened twice in the last 30 days.

Originally my concern is that even if the SystemExits numbers are low, the entire worker is being shutdown which could cut off all the other connections assigned to that worker. However, I just realised that we're not using Gunicorn's gevent extension, so actually there's a 1:1 relationship between the worker and a connection (we're using the default sync workers). It probably isn't causing that big of an issue for clients, in that case. The noisiness of these alerts is annoying and takes up a lot of time in triage though, so I'd still advocate for a high priority for the sake of us maintainers not being constantly distracted by SystemExit alerts.

sarayourfriend commented 1 year ago

This feels like a separate issue, maybe one that would warrant a small implementation plan

Caching timed out waveform generation and skipping them on subsequent requests should be part of this issue, after re-reading it. The part that should be a separate implementation plan is the DAG that goes through and generates the waveforms afterwards.

This might actually be a better use case for Celery tasks rather than a DAG, cause then we can immediately schedule the waveform generation if the original request times out and use the existing code rather than needing to add waveform generation to the DAG.

In any case, definitely a separate concern from the timeouts.

I'm moving this down to medium priority for the reasons Zack and I have shared above.

sarayourfriend commented 4 days ago

We haven't seen any errors like these in a long time. In the last month, we have had zero waveform requests take longer than 5 seconds (I was surprised by this, but it's somehow true!).

There are improvements we can make to the waveform endpoint, but this issue and the details it goes over are no longer relevant.

Part of that is because we don't use gunicorn for the API any more, but mostly we just don't have this timeout issue :tada:

WordPress / openverse

Audio waveform generation can exceed gunicorn worker timeout #2468

Description

Potential solution