Examples and documentation of OpenMP and std::thread workers

nilsdeppe commented 1 year ago

What Needs to be Done?

[ ] Add examples or tutorials to the documentation explaining how threading works.
[ ] Add documentation explaining how threads/workers conceptually differ from nodes, how to access each.
[ ] Document how to have a distributed object on a one-per-node basis (equivalent to Charm++ nodegroups)
[ ] Document how to have a distributed object on a one-per-thread basis (equivalent to Charm++ groups)
[ ] #2038
[ ] Document how to have messages between nodes that do not serialize data. In particular, it would be great to be able to send a message with a pointer if going from one thread to another on a node, but then actually serialize the data when going to a different node.

Is your feature request related to a problem? Please describe.

Thanks for the great library! I started playing around with some of the examples and have really been struggling with the threading capabilities. Whenever I do vt::theContext()->getWorker() I get -275. I'm also not sure how to actually set the number of workers correctly. I used vt::initialize(argc, argv, vt::WorkerCountType{4});, but then the code hangs inside vt::finalize();. This was just changing some of the hello_world examples to try and run with OpenMP threads.

Describe potential solution outcome

Describe alternatives you've considered

Additional context

lifflander commented 1 year ago

@nilsdeppe

Sorry that this has sat so long without a reply.

Thanks for the great library! I started playing around with some of the examples and have really been struggling with the threading capabilities. Whenever I do vt::theContext()->getWorker() I get -275. I'm also not sure how to actually set the number of workers correctly. I used vt::initialize(argc, argv, vt::WorkerCountType{4});, but then the code hangs inside vt::finalize();. This was just changing some of the hello_world examples to try and run with OpenMP threads.

When I started writing VT, I thought that having worker threads was a potentially useful/a good idea. However, as we have gained users, I realized that all the apps use their own threading packages (mostly Kokkos). The cost of supporting worker threads (and making the runtime thread-safe didn't seem worth it. Thus, the worker support is old and probably does not work correctly anymore.

Our new philosophy is that application should do whatever they want with regard to threading: whether that be Kokkos/OpenMP/RAJA/etc. and VT should not interfere with this.

I am opening an issue to remove workers from VT as we don't plan to support them. Thanks for your interest in our library.

By the way, does your use case necessitate that the runtime support threading directly?

lifflander commented 1 year ago

Refer to https://github.com/DARMA-tasking/vt/issues/2037

lifflander commented 1 year ago

Regarding this point:

Document how to have inline calls for collection objects on the same thread (basically elide the RTS but have the call be recorded for LB timing and communication purposes)

Use proxy.invoke instead of proxy.send.

nilsdeppe commented 1 year ago

Thanks for the detailed answer! That definitely makes sense.

We don't need worker threads, though one MPI rank per core is not great because of intra-node MPI calls (I guess these can be zero-copy, but that also seems like a lot of work). Using Kokkos/etc is totally fine and probably what we will end up doing anyway given that the DOE machines are now all GPU-based. How does that interface with load balancing? Specifically, we currently have our computational domain (solving hyperbolic PDEs) chopped up into little cubes, then each core gets several cubes assigned, and the cubes can get moved around for load balancing (in the Charm++ implementation). With VT it seems we would want to have threads bound to cores to work on the cubes, and then use VT messaging for inter-node communication between cubes. It's totally fine if the conclusion is we need to somewhat manually do the LB of the cubes. We are actually working on that with Charm++ too because communication awareness is critical, we know the exact communication pattern (no need to try to infer it from messages, just use a space-filling curve), and we know exactly how expensive each cube is based on the number of grid points.

Does that give you an idea of what we are doing?

proxy.invoke looks perfect, thanks! :D

DARMA-tasking / vt

Examples and documentation of OpenMP and std::thread workers #1988