Open eric-haibin-lin opened 4 years ago
There are a couple of thread local variables in the python level, too:
They suggest that if the frontend python thread is switched, some of these contexts are lost.
Specifically, we'd need to use something like https://docs.python.org/3/library/contextvars.html But Contextvar in it's current form is not sufficient, as it doesn't allow us to hook in C API calls. This feature should be added to Python standard library, and on the MXNet side we should use a patched version of Contextvar
@eric-haibin-lin I raised the question on the Python Bug Tracker. Python maintainers recommend to refactor MXNet to make state management pluggable / customizable.
Adding callbacks to contextvars is infeasible:
For extra context: context switches occur on every callback invocation in asyncio and there can be thousands of them per seconds (or even more). Adding any extra code to context switching code will noticeably degrade the performance.
Reference: https://bugs.python.org/issue39660#msg362370
@leezu thanks for the followup.
I roughly skim-through the code base and searched for classes with
::Get
method. Some of these class objects are global singletons and have potential thread safety issues. I list the initial assessment:thread safe classes
These classes are thread safe
is a thread_local object or contains thread local variables
The thread safety depends on the lifecycle of the thread. Are there alternative ways to avoid them?
classes that are not thread safe
These classes are not thread safe and may cause bugs:
I didn't look into C APIs. And there are lots of other thread_local objects spreading around in the code base as well:
Related issuie: https://github.com/apache/incubator-mxnet/issues/17612