NNG Panic on Python shutdown

broglep-work commented 2 years ago

We experience (under certain conditions) reproducible crash of python process.

panic: pthread_mutex_lock: Invalid argument
This message is indicative of a BUG.
Report this at https://github.com/nanomsg/nng/issues
/home/vsts/work/1/lib/python3.8/site-packages/pynng/_nng.abi3.so(nni_panic+0x112) [0x7fc452152762]
/home/vsts/work/1/lib/python3.8/site-packages/pynng/_nng.abi3.so(nni_aio_fini+0x20) [0x7fc45214bfe0]
/home/vsts/work/1/lib/python3.8/site-packages/pynng/_nng.abi3.so(nni_aio_free+0x1b) [0x7fc45214c11b]
/home/vsts/work/1/lib/python3.8/site-packages/pynng/_nng.abi3.so(+0x4632f) [0x7fc4520fd32f]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x158783) [0x7fc4569ea783]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x4adc) [0x7fc456a31fcc]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0xfa) [0x7fc4569c2dfa]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7b8) [0x7fc456a2dca8]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x301) [0x7fc456a2c8d1]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x18e) [0x7fc4569c2e8e]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x1337b3) [0x7fc4569c57b3]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x27c4) [0x7fc456a2fcb4]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x13770b) [0x7fc4569c970b]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x1e8624) [0x7fc456a7a624]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x1e8d67) [0x7fc456a7ad67]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x1e867e) [0x7fc456a7a67e]

My investigation showed that it is caused by access of nni_aio_lk which was deinit'd by nng_fini / nni_fini / nni_aio_sys_fini. nni_aio_sys_fini is triggered by _pynng_atexit and nni_aio_free by AIOHelper.__del__

This looks like a timing issue on python shutdown: depending on atexit functions are called and objects garbage collected, it might be possible that nng was already deinitialized and pynng, but pynng is still calling the nng lib

It is currently unclear how best to address this issue. Any ideas?

(for background, this occurs reproducibly in our CI when using pynng 0.7.1, asyncio & pytest, but we did also see it from time to time when running our application on developer machines and stopping the application)

ntakouris commented 1 year ago

I can confirm this bug, with the same setup.

codypiersall commented 1 year ago

This is a good find. Thanks for posting the issue, and I think you nailed the problem.

I'm not sure the best solution either. A crash is never good. I can't completely remember why we even call nni_fini(), maybe just for good hygiene? I can only think of a couple use cases for calling it:

If trying to detect a memory leak with Valgrind, it will report that a lot of stuff leaked at the process exit
If a process is embedding a Python interpreter, this would allow it to shut it down and restart it (possibly)

Maybe we could add a global variable to pynng, a bool called nng_fini_at_exit, that gets checked when deciding whether to call nng_fini() or not. It should probably default to False to avoid any process-crashing race conditions.

Thoughts?

broglep-work commented 1 year ago

I think that solution is ok, and having it default to False is also good as for most people it probably would not make much of a difference as the process is anyway about to end. For added flexibility we could expose the nng_fini in the pynng api, so that the user can call it themselves at the appropriate time. In that case maybe we could even get rid of nng_fini_at_exit and _pynng_atexit and let the user himself register atexit.register() if it is necessary for the usecase (and have an corresponding section in the documentation about that)

jason-cso commented 1 year ago

I ran into the same bug and wrote a simple repro case. About half the time I get nni logs like OP.

#!/usr/bin/env python3

from pynng import Surveyor0, Respondent0, Timeout
import os
import signal
import threading
import traceback
import time

address = "tcp://127.0.0.1:13812"

def daemon(function):
    t = threading.Thread(target=function, daemon=True)
    print(f"Starting thread: {function.__name__}() as {t.name}")
    t.start()
    return t

def s():
    with Surveyor0(listen=address) as surveyor:
        surveyor.survey_time = 500  # milliseconds
        while True:
            surveyor.send(b"foo")
            try:
                while True:
                    m = surveyor.recv()
                    print(f"Got {m} from client")
            except Timeout:
                pass

def c():
    with Respondent0(dial=address) as responder:
        while True:
            m = responder.recv()
            print(f"Got {m} from server")
            responder.send(b"bar")

daemon(s)
time.sleep(0.1)
daemon(c)
time.sleep(2)

olivierpelet commented 1 year ago

I ran into the same bug and wrote a simple repro case. About half the time I get nni logs like OP.

#!/usr/bin/env python3

from pynng import Surveyor0, Respondent0, Timeout
import os
import signal
import threading
import traceback
import time

address = "tcp://127.0.0.1:13812"

def daemon(function):
    t = threading.Thread(target=function, daemon=True)
    print(f"Starting thread: {function.__name__}() as {t.name}")
    t.start()
    return t

def s():
    with Surveyor0(listen=address) as surveyor:
        surveyor.survey_time = 500  # milliseconds
        while True:
            surveyor.send(b"foo")
            try:
                while True:
                    m = surveyor.recv()
                    print(f"Got {m} from client")
            except Timeout:
                pass

def c():
    with Respondent0(dial=address) as responder:
        while True:
            m = responder.recv()
            print(f"Got {m} from server")
            responder.send(b"bar")

daemon(s)
time.sleep(0.1)
daemon(c)
time.sleep(2)

I've forked and updated nng (and mbedtls) to the latest version, and the code does not crash anymore. Also, I've seen that @broglep-work and another user made some patches in a fork here. I will probably include these patchs in my branch as well

codypiersall / pynng

NNG Panic on Python shutdown #108