Handle Python package that internally dlopen shared libraries

magnatelee commented 4 years ago

Right now, a Python module that loads a shared library can cause problems when the Python code is running on multiple nodes without DCR. In particular, when that shared library contains task implementations, the nodes other than the one that imported the module can't see those tasks, as the shared library hasn't been loaded on those nodes. I briefly discussed this issue with @lightsighter and @elliottslaughter and there seem to be a few options:

Never running Python programs without DCR. This may preclude packages that can't use DCR for some reason (e.g., non-determinism in internal data structures).
A command line flag (e.g., -lg:pyimport) that takes the list of Python modules to preload. This is certainly possible only when the user can provide this list upfront.
Launching remote tasks that call import .... This would be the ugliest solution, but can be done without any changes in the runtime.
A runtime mechanism to globally load a shared library (global dlopen).

My preference is either 1 or 2 as a fallback, but @lightsighter may argue about what we can't force DCR in all cases. @lightsighter @elliottslaughter @streichler any comments/thoughts?

lightsighter commented 4 years ago

I have a 5th option that I thought about: each python module can have a dummy python task variant that is never used for anything, but when a module is loaded on one node, it will request a global registration of the variant on all the nodes, which will then load the module on all nodes, allowing each module to do local registration of all its variants. The downside to this is you can potentially get O(N^2) global registration messages for the dummy variant, but feels like it's half a solution.

lightsighter commented 4 years ago

Also, this issue applies as well in cases where we are running with DCR but not all nodes have a shard.

elliottslaughter commented 4 years ago

Just to be clear, this issue only applies to native (C++) tasks. In Python, a task is identified by the tuple module_name, function_name, and Realm is smart and does the import on the fly if it doesn't already have it.

So we're really only talking about how to do global native task registration.

It's a hack, but one answer would be to say that whenever you call native tasks defined inside a Python library, you must call at least one Python task first (and have it replicated to all the nodes). This doesn't seem any worse than (3).

Another option, since we're talking specifically about shared native libraries is to ask users to actually use Realm's libdl support and to register native tasks as so_path and symbol. Then you get all the goodness that comes along with that portability too.

lightsighter commented 4 years ago

So we're really only talking about how to do global native task registration.

This isn't specific to python. It happens anytime a task calls dlopen on a library with task variants of any kind in it. It just happens that the Python interpreter is the first real case that we've encountered. I doubt it will be the last.

This doesn't seem any worse than (3)

That is option (3). You run a python task on every python processor in the system and import the module in the task body.

Another option, since we're talking specifically about shared native libraries is to ask users to actually use Realm's libdl support

Imagine there are 300K of them and they all have nasty template mangled names. How confident are we that this will both work and be efficient?

elliottslaughter commented 4 years ago

Ok, what if we instead say that a "shared library task" is registered only with the name of the library. The idea being, Realm will dlopen the library, and expect the library's initialization routines to do any task registration internally.

Basically, I'm trying to find ways to get this to piggyback off the existing Realm task loading infrastructure, rather than bolting on some new way to add libraries.

lightsighter commented 4 years ago

Ok, what if we instead say that a "shared library task" is registered only with the name of the library.

Ok, so instead of providing an explicit CodeDescriptor object for a task variant, we would instead tell Realm that you can find this task variant in this library at this symbol, which is pretty much what the DSO CodeDescriptor already does, it's just that Realm expects the library to already be loaded when it goes looking right now. So the change would be that Realm now may need to detect the case where the library isn't loaded yet and then dlopen it if it isn't.

elliottslaughter commented 4 years ago

I think we'd make the symbol field of the DSO CodeDescriptor optional (since that's what you're objecting to). But yes, then we'd teach Realm to track what libraries are open and dlopen when it sees a new one for the first time.

Basically, a CodeDescriptor is just a thing that tells you enough information to load a task, and in this case all you really need is "load this library and the task will register itself", making the actual symbol unnecessary.

lightsighter commented 4 years ago

300K lazily loaded code descriptors are more tolerable to me than 300K eagerly loaded code descriptors. I still feel like we need some kind of generic support for dynamic library loading in distributed environments to address cases like this.

lightsighter commented 4 years ago

I've added support for option 3 to the legion_python module. Everybody can call import_global on a module name or object and if its in the top-level task it will make sure that it gets registered on all the python interpreters that are running. The function will not return until all those imports are done. If any import fails then the function will raise an ImportError.

It's worth noting that option 1 is not viable for me as I already have Python libraries that have to be able to run in distributed settings without control replication.

lightsighter commented 4 years ago

I also think option 2 is too burdensome on the user, especially if they don't know which modules are imported inside of other modules they are using. It's fine as a stop-gap if someone wants to prototype it, but seems unrealistic to expect users to know if they are using some "special" module or not that has to be loaded in a different way.

lightsighter commented 4 years ago

I've tested my implementation of option 3 in a few Legate libraries now and it seems to work well both with and without control replication. I've come to prefer this option over any of the others. Users don't have to be aware that the library is going through a distributed import. It's only library developers that need to be aware of how to use option 3. Furthermore it doesn't require any runtime machinery at the Legion or Realm levels to implement and can be done with the existing primitives that we have now. Unless someone wants to object, I'm planning to close this issue in the next day or so.

streichler commented 4 years ago

I was re-reading the whole comment thread and this statement isn't accurate:

So the change would be that Realm now may need to detect the case where the library isn't loaded yet and then dlopen it if it isn't.

Realm will always try to dlopen the dso_name of a DSOReferenceImplementation exactly once, relying on libdl to hand back the existing module if it's already loaded: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/codedesc.cc#L365-378

If that shared object has an _init function that does its own task registrations (which can be function-pointer-based), you get what @elliottslaughter was describing:

The idea being, Realm will dlopen the library, and expect the library's initialization routines to do any task registration internally.

elliottslaughter commented 4 years ago

I don't mind that option (3) exists, but I still feel it's a non-ideal situation that will recur in the future. Per @streichler's comment I think it's worth at least trying the DSO solution to see if that approach can work.

lightsighter commented 4 years ago

Would Realm be able to guarantee that those registrations are done globally across all Python interpreters on all nodes? If not, then the DSO solution is useless in the Legate case.

elliottslaughter commented 4 years ago

Can you explain more? Why does it need to be global?

lightsighter commented 4 years ago

If I load the legate.dask module on one node and launch a task that goes to another node, then the legate.dask task implementation better be registered on that remote node before Realm sees it. The only way to guarantee that is that the task implementation for that python function is registered with Realm and the only way to guarantee that is that the legate.dask module has been loaded in the Python interpreter on the remote node.

lightsighter commented 4 years ago

The problem is guaranteeing that a single Python module has been loaded in all Realm Python interpreters so any Python functions that represent task variants have been registered as task variants on that node before Realm sees a request to run one of them.

streichler commented 4 years ago

Unless I'm missing something, waiting on the completion event of a call to Processor::register_task_by_kind with global == true would do exactly that?

lightsighter commented 4 years ago

Global registrations work with Python code descriptors?

Also, how do you solve the O(N^2) problem that occurs with control replication where the module is loaded on every node and then they all try to do global registrations to all other nodes?

Note that my solution (3) above solves both those problems. For another solution to be comparable it has to solve them both also.

streichler commented 4 years ago

Global registrations work with Python code descriptors?

They should. I don't know that we actually test it anywhere though.

Also, how do you solve the O(N^2) problem that occurs with control replication where the module is loaded on every node and then they all try to do global registrations to all other nodes?

Well, the right answer I think would be that something that is aware of the existence of control replication would de-duplicate it to avoid unnecessary work. But if that didn't happen and O(N) requests got sent to each node, libdl would effectively do the de-duplication on each node because the DSO will only be loaded once.

lightsighter commented 4 years ago

Well, the right answer I think would be that something that is aware of the existence of control replication would de-duplicate it to avoid unnecessary work.

Nobody should ever need to know whether they are in a control replication context or not when they are running, the semantics should be the same either way. You said that and I still agree with you. ;) Note that my solution (3) above doesn't require either the caller or the implementation to check whether we are control replicated or not. I would expect the same to be true of any of other solution as well.

But if that didn't happen and O(N) requests got sent to each node, libdl would effectively do the de-duplication on each node because the DSO will only be loaded once.

That still causes O(N^2) traffic in the network and will not be scalable

streichler commented 4 years ago

Nobody should ever need to know whether they are in a control replication context or not when they are running, the semantics should be the same either way.

The "something" I was referring to in this case was the Legion runtime.

lightsighter commented 4 years ago

Legion doesn't know anything about Python modules and I don't want it to. If we're going to push this up to the Legion runtime then I'm going to push it up to my solution (3) and say we're done. Legion isn't going to do anything more intelligent than what I've already implemented.

elliottslaughter commented 4 years ago

@lightsighter isn't this a problem with dynamic task registration and DCR, period? If your set N of shards is not equal to your set M of runtimes, you cannot do local registration because N != M but you also don't want to do global registration because that costs you O(NM) messages.

If I'm not mistaken this a problem regardless of anything else we've discussed above.

lightsighter commented 4 years ago

Sure, I can solve that problem, still doesn't handle registration of projection functors, sharding functors, mappers, etc that also need to registered by Python modules, and there is no notion of "global" registration for those.

elliottslaughter commented 4 years ago

Maybe we should add global registration for those (along with the ability to register them as Realm code descriptors so that they can be actually portable). Otherwise it seems to me that dynamic task registration is fundamentally broken for a variety of use cases, again even aside from anything we've been specifically talking about in this issue.

lightsighter commented 4 years ago

How do you globally (and portably) register a C++ object, especially one that needs to take parameters (e.g. like a mapper) on different nodes?

elliottslaughter commented 4 years ago

You register a blob of code (possibly with a blob of arguments, if necessary) that constructs and returns the object in question, or else constructs it and passes it along to some other API (possibly a local registration, but I don't really care).

lightsighter commented 4 years ago

Realm doesn't provide any way to make C++ object constructor's function pointers with arbitrary types portable across processes.

elliottslaughter commented 4 years ago

Why can't you just make a function that does the construction?

MyObject my_ctor(...) {
  return MyObject(...);
}

And pass this through a Realm code descriptor?

Maybe not as ergonomic as you'd like, but seems to do the job.

lightsighter commented 4 years ago

Because the interface for registering mappers doesn't take a type, it takes an instantiated object: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h#L6929-6941

streichler commented 4 years ago

Because the interface for registering mappers doesn't take a type, it takes an instantiated object:

Actually, that interface is problematic in a subprocess world (the object allocated by an application thread isn't necessarily visible to all utility procs) as well, so maybe we should consider a change here that fixes both issues?

elliottslaughter commented 4 years ago

Right, I'm proposing to create a new interface for registration that takes a Realm code descriptor (or however you want to represent the registration thunk), and either have that thunk return the object in question or introduce a further API for where the object should go when the thunk runs.

lightsighter commented 4 years ago

@elliottslaughter's solution won't fix the problem that @streichler is trying to fix. Making a function isn't going to handle the memory allocation in subprocess problem. There are tens of calls where the runtime takes in things allocated by the application for registration right now that need to be fixed.

lightsighter commented 4 years ago

I think I have a different plan that will solve these two issues in independent ways.

elliottslaughter commented 4 years ago

I think my solution fixes the problem as long as the task shim that wraps the registration thunk enables the Realm malloc.

elliottslaughter commented 4 years ago

Er, not the task shim (this isn't necessarily asynchronous), but the runtime API that calls the registration thunk.

lightsighter commented 4 years ago

That's not going to work when Legion has it's own heap which will never be allowed to be visible to the user.

elliottslaughter commented 4 years ago

I'm not sure why you say it wouldn't work. Legion is going to use its own memory allocator for its own internal data structures. Nothing I'm suggesting here changes that. The issue (I assume) is that the user's object itself also needs to be put in some heap which is visible to various processes, and this part at least makes sure any user allocations will be in a subprocess-visible heap. Otherwise you'd force users to rewrite all their mappers to be valid in a subprocess world, which I don't think it particularly viable either.

lightsighter commented 4 years ago

A mapper is not a "user object" anymore after it gets registered with the runtime, the runtime owns it and it needs to be visible to all sub-processes. Same thing is true of sharding functors and projection functors and attach functors and reduction operators and ...

Furthermore, unlike Realm, I don't plan to make my allocator visible to the whole world.

elliottslaughter commented 4 years ago

Ok, but you realize that a mapper is not a single memory allocation either. It may contain std::vector or who knows what custom data structure. I don't think relying on Legion's approach of annotating every single data structure with a custom allocator is realistic here, even for STL that's a pain, and the user may be using some data structure that's not defined in the STL.

I think if you want mappers in subprocess not to be a massive pain, you need an approach where you set the default allocator and let it catch whatever the user creates.

lightsighter commented 4 years ago

I agree with that, but I'm not going to do it by exposing a "Legion allocator". Legion will do the allocation of the Mapper object, but it's not going to be visible to the user and it's not going to be a distributed registration across all the nodes in the system either.

elliottslaughter commented 4 years ago

Perhaps we should move the discussion of allocators and runtime objects into a new issue, we're pretty far off on a tangent at this point.

At any rate, I hope I've demonstrated that it's possible to have an interface in which runtime objects are registered via callbacks such that we can use Realm code descriptors. Such an approach is, I believe, compatible with a subprocess world, though it's not the only way to address this. More importantly, the callbacks via code descriptors enables such registrations to be portable, so that you can do true global registration of Legion objects. I think this is important if you want to have truly first class dynamic registration; if you have dynamic registration only for tasks there are a wide variety of things you simply won't be able to do (at least without substantial bending over backwards). In my opinion, the difficulties we've seen with Python are just a special case of that (Python being a more dynamic language exacerbates issues we'd expect to see in other dynamic systems). While I don't object to the current workaround, I do think what I've suggested above is a more principled approach and also helps us deal with other cases.

There may still be low-level details to address; I'm not claiming what I suggest above is exhaustive. I do think based on what I've seen of the discussion so far that these are mere "details" and could be resolved in a sensible manner with reasonable effort. However further discussion should probably also be moved elsewhere because at a certain level this is not really about Python but about enabling dynamic registration of Legion objects other than code.

lightsighter commented 4 years ago

There is absolutely no way I'm going to support registering every kind of runtime object with a code descriptor API. That is just an insane amount of work. I will outline my solution here later today once it is working.

lightsighter commented 4 years ago

I have added my one and only hook that I am going to provide in Legion for addressing this problem. Users can now ask that the runtime perform a registration callback function in a global way after the runtime has started. https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h#L7552-7564 If global invocation is requested, then this call will not return until the registration has been performed on every node. The runtime only uses O(N) messages to make this happen and will avoid a O(N^2) message storm. It guarantees that this callback function is invoked at most once in any address space anywhere in the system.

You can use this method in a shared object to ensure it is globally loaded everywhere:

namespace {
    static void __attribute__ ((constructor)) perform_registration(void) {
        Runtime::perform_registration_callback(callback_function, true/*global*/);
    }
}

When dlopen is called inside of a Legion task that will dlopen this object on one node which will invoke the constructor and ask the runtime to globally perform the callbacks. In the process of migrating this function to remote nodes we will automatically dlopen the same shared object on remote nodes. The runtime ensures the registration callback is only performed once on each node. (We also avoid potential deadlock cases here as well.) In the case of control replication, we will know to only send messages to nodes where there aren't shards. In the case of multiple shards in the same process dlopen naturally deduplicates the calls to this for us. Note this method will also work if you dlopen the shared object and then explicitly invoke the perform_registration_callback from the task as well, both with and without control replication.

elliottslaughter commented 4 years ago

I'm confused about how global works in non-DCR (or in DCR where shards != runtimes). As best I can tell, RegistrationCallbackFnptr is a literal function pointer. I don't see how this can be migrated automatically to other nodes. Is there some logic here that I'm missing, or do we need a way to do this with a Realm code descriptor?

lightsighter commented 4 years ago

Realm provides mechanisms for converting function pointers into DSO code implementations: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/runtime.cc#L11410-11424 and vice-versa: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/runtime.cc#L16381-16396

elliottslaughter commented 4 years ago

I think in order for this to be useful for dynamic code generation, I would need one of the following two options:

Pass a blob of void * and size_t to the registration callback (so I can pack the LLVM bitcode, or whatever I've got).
Add a code path where the registration callback itself can be a Realm code descriptor, so I can generate the callback itself dynamically.

I'm fine either way and either should be sufficient for my purposes.

lightsighter commented 4 years ago

I have a slight preference for (2), but Legion does need to be able to invoke this code descriptor because it has to pass in argument that are Legion types. So how would Legion go about calling the code descriptor to do callback once we get it on a remote node?

lightsighter commented 4 years ago

It's worth noting though that I do know the type of the function that the CodeDescriptor should represent, if that helps in anyway with JIT-ing it or converting it into something that is callable. Downside is that the function pointer as it stands today is a C++ function pointer with C++ types in its type signature.

StanfordLegion / legion

Handle Python package that internally dlopen shared libraries #748