Open broglep-work opened 2 years ago
I can confirm this bug, with the same setup.
This is a good find. Thanks for posting the issue, and I think you nailed the problem.
I'm not sure the best solution either. A crash is never good. I can't completely remember why we even call nni_fini()
, maybe just for good hygiene? I can only think of a couple use cases for calling it:
Maybe we could add a global variable to pynng, a bool called nng_fini_at_exit
, that gets checked when deciding whether to call nng_fini()
or not. It should probably default to False
to avoid any process-crashing race conditions.
Thoughts?
I think that solution is ok, and having it default to False
is also good as for most people it probably would not make much of a difference as the process is anyway about to end. For added flexibility we could expose the nng_fini
in the pynng api, so that the user can call it themselves at the appropriate time. In that case maybe we could even get rid of nng_fini_at_exit
and _pynng_atexit
and let the user himself register atexit.register()
if it is necessary for the usecase (and have an corresponding section in the documentation about that)
I ran into the same bug and wrote a simple repro case. About half the time I get nni logs like OP.
#!/usr/bin/env python3
from pynng import Surveyor0, Respondent0, Timeout
import os
import signal
import threading
import traceback
import time
address = "tcp://127.0.0.1:13812"
def daemon(function):
t = threading.Thread(target=function, daemon=True)
print(f"Starting thread: {function.__name__}() as {t.name}")
t.start()
return t
def s():
with Surveyor0(listen=address) as surveyor:
surveyor.survey_time = 500 # milliseconds
while True:
surveyor.send(b"foo")
try:
while True:
m = surveyor.recv()
print(f"Got {m} from client")
except Timeout:
pass
def c():
with Respondent0(dial=address) as responder:
while True:
m = responder.recv()
print(f"Got {m} from server")
responder.send(b"bar")
daemon(s)
time.sleep(0.1)
daemon(c)
time.sleep(2)
I ran into the same bug and wrote a simple repro case. About half the time I get nni logs like OP.
#!/usr/bin/env python3 from pynng import Surveyor0, Respondent0, Timeout import os import signal import threading import traceback import time address = "tcp://127.0.0.1:13812" def daemon(function): t = threading.Thread(target=function, daemon=True) print(f"Starting thread: {function.__name__}() as {t.name}") t.start() return t def s(): with Surveyor0(listen=address) as surveyor: surveyor.survey_time = 500 # milliseconds while True: surveyor.send(b"foo") try: while True: m = surveyor.recv() print(f"Got {m} from client") except Timeout: pass def c(): with Respondent0(dial=address) as responder: while True: m = responder.recv() print(f"Got {m} from server") responder.send(b"bar") daemon(s) time.sleep(0.1) daemon(c) time.sleep(2)
I've forked and updated nng (and mbedtls) to the latest version, and the code does not crash anymore. Also, I've seen that @broglep-work and another user made some patches in a fork here. I will probably include these patchs in my branch as well
We experience (under certain conditions) reproducible crash of python process.
My investigation showed that it is caused by access of
nni_aio_lk
which was deinit'd bynng_fini
/nni_fini
/nni_aio_sys_fini
.nni_aio_sys_fini
is triggered by _pynng_atexit andnni_aio_free
byAIOHelper.__del__
This looks like a timing issue on python shutdown: depending on atexit functions are called and objects garbage collected, it might be possible that nng was already deinitialized and pynng, but pynng is still calling the nng lib
It is currently unclear how best to address this issue. Any ideas?
(for background, this occurs reproducibly in our CI when using pynng 0.7.1, asyncio & pytest, but we did also see it from time to time when running our application on developer machines and stopping the application)