StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
668 stars 146 forks source link

Handle Python package that internally dlopen shared libraries #748

Open magnatelee opened 4 years ago

magnatelee commented 4 years ago

Right now, a Python module that loads a shared library can cause problems when the Python code is running on multiple nodes without DCR. In particular, when that shared library contains task implementations, the nodes other than the one that imported the module can't see those tasks, as the shared library hasn't been loaded on those nodes. I briefly discussed this issue with @lightsighter and @elliottslaughter and there seem to be a few options:

  1. Never running Python programs without DCR. This may preclude packages that can't use DCR for some reason (e.g., non-determinism in internal data structures).
  2. A command line flag (e.g., -lg:pyimport) that takes the list of Python modules to preload. This is certainly possible only when the user can provide this list upfront.
  3. Launching remote tasks that call import .... This would be the ugliest solution, but can be done without any changes in the runtime.
  4. A runtime mechanism to globally load a shared library (global dlopen).

My preference is either 1 or 2 as a fallback, but @lightsighter may argue about what we can't force DCR in all cases. @lightsighter @elliottslaughter @streichler any comments/thoughts?

elliottslaughter commented 4 years ago

It would be nice if I could use the C interface for this. I'd rather not be forced to figure out how to do the codegen to call C++ methods. (Even if the first thing I do is convert the C++ objects in C API handles, that still involves going through a C++ shim to get there.)

I'm not quite sure I see how to factor this so that the main Legion doesn't need to be aware of the C API while still going with option (2). My intention with (1) was to provide a way around this so that the core Legion implementation doesn't need special support for the C API.

lightsighter commented 4 years ago

I'm fine with the CodeDescriptor path needing a different type that is based on C types. It won't be hard to make sure a function type. What is still missing is a way for me to invoke a code descriptor myself.

lightsighter commented 4 years ago

Global registration callback functions are now supported in Legion and I've updated all the Legate libraries to use them. The global registration callbacks work correctly both with and without control replication. As in the Legate case, any shared object should be able to request a global registration callback in its constructor in order to ensure that the shared object is loaded on all nodes in the system. The runtime automatically deduplicates them to avoid O(N^2) messages being required to do this loading.

I'm still open to the CodeDescriptor path, but I need to have a way to invoke the CodeDescriptor at the Legion level, which I don't believe that I have any way to do at the moment. Here would be the C type though for the function that the CodeDescriptor should contain:

void (*)(legion_machine_t, legion_runtime_t, legion_processor_t*, size_t)
elliottslaughter commented 4 years ago

I assume the CodeDescriptor issue is really a Realm thing? It seems like we have a path for generating tasks, just not a path for generating function pointers.

lightsighter commented 4 years ago

I assume the CodeDescriptor issue is really a Realm thing?

Yes, there's no way to invoke a generic code descriptor in Realm today. I would effectively have to write a shim that wraps the code descriptor into another code descriptor that has the right type for a Realm task so I can run it. I don't know how to do that for a code descriptor today which could be anything internally.

It seems like we have a path for generating tasks, just not a path for generating function pointers.

To be clear, I don't need a function pointer as I will identify the function by its shared object name and its symbol name in that shared object. What I do need though is a way to run the code inside of the code descriptor object with arguments that I control.

lightsighter commented 4 years ago

An update to this issue: despite my best efforts, we've concluded that it is unsafe to call perform_registration_callback inside of dlopen, not because of anything that Legion does, but because Realm can't make new threads when inside of dlopen due to issues with TLS.

Is the only thing missing before we close this issue support for a "payload" of opaque data requested by @elliottslaughter? Anything else? I know that all the Legate libraries are working with this code so I'm confident it is working well.