Since 1.12.1: Shutdown (or rather startup?) fails due to unreferenced variable

Bogdanp / dramatiq

A fast and reliable background task processing library for Python 3.

https://dramatiq.io

GNU Lesser General Public License v3.0

4.3k stars 309 forks source link

Since 1.12.1: Shutdown (or rather startup?) fails due to unreferenced variable #463

Closed aberres closed 2 years ago

aberres commented 2 years ago

What version of Dramatiq are you using?

1.21.1

What did you do?

Since the update from 1.12.0 to 1.12.1 we are seeing the following errors:

 NameError: free variable 'running' referenced before assignment in enclosing scope
at .termhandler ( /venv/lib/python3.9/site-packages/dramatiq/cli.py:393 )
at .pthread_sigmask ( /usr/local/lib/python3.9/signal.py:60 )
at .try_unblock_signals ( /venv/lib/python3.9/site-packages/dramatiq/cli.py:238 )
at .worker_process ( /venv/lib/python3.9/site-packages/dramatiq/cli.py:409 )
at .run ( /usr/local/lib/python3.9/multiprocessing/process.py:108 )
at ._bootstrap ( /usr/local/lib/python3.9/multiprocessing/process.py:315 )

I could not yet investigate further, but maybe it is related to 8da51575ef402c91069f48cf2a524075cd3ffb8c?

FinnLidbetter commented 2 years ago

Oh maybe just moving running = True above the try_unblock_signals will fix this https://github.com/Bogdanp/dramatiq/blob/75579439172ea126efb4f83c753b0b513afb03ad/dramatiq/cli.py#L408-L412

FinnLidbetter commented 2 years ago

Or probably moving running = True to above the definition of term_handler makes more sense (e.g., to line 390)

Bogdanp commented 2 years ago

While @FinnLidbetter 's change seems appropriate, I wonder if you could give us more details about your setup @aberres. What OS are you on, and how are you stopping the process? Does it always happen? The window in which this can happen at the moment (given the current implementation) seems extremely narrow, and I'm unable to reproduce the issue.

aberres commented 2 years ago

Yes, sure.

The environment this happens on are Ubuntu Docker containers running on GKE (Google Kubernetes Engine). As we shut down/restart worker processes depending on available workload and additionally spin up complete environments for every branch in CI we got quite some shutdowns and restarts of workers. My gut feeling is that we are not seeing this issue most of the time.

I had another look at the logs and I'm a bit puzzled. It seems as if this message is not thrown during shutdown, but during startup.

If I read the logs correctly, sometimes SIGTERM is triggered pretty much instantly.

Sorry for the screenshots, but getting human readable text logs out of the Google Logs Explorer does not seem to be straight forward.

Logs

Note the second line, right after starting:

Some time later (we are waiting for a database instance to come up) the startup process continues as expected:

And then boom

Bogdanp commented 2 years ago

If I read the logs correctly, sometimes SIGTERM is triggered pretty much instantly.

OK, that's what I suspected. In that case, Finn's change should fix this problem. Thanks for the details!