Open nilsdeppe opened 1 year ago
@nilsdeppe
Sorry that this has sat so long without a reply.
Thanks for the great library! I started playing around with some of the examples and have really been struggling with the threading capabilities. Whenever I do
vt::theContext()->getWorker()
I get-275
. I'm also not sure how to actually set the number of workers correctly. I usedvt::initialize(argc, argv, vt::WorkerCountType{4});
, but then the code hangs insidevt::finalize();
. This was just changing some of thehello_world
examples to try and run with OpenMP threads.
When I started writing VT, I thought that having worker threads was a potentially useful/a good idea. However, as we have gained users, I realized that all the apps use their own threading packages (mostly Kokkos). The cost of supporting worker threads (and making the runtime thread-safe didn't seem worth it. Thus, the worker support is old and probably does not work correctly anymore.
Our new philosophy is that application should do whatever they want with regard to threading: whether that be Kokkos/OpenMP/RAJA/etc. and VT should not interfere with this.
I am opening an issue to remove workers from VT as we don't plan to support them. Thanks for your interest in our library.
By the way, does your use case necessitate that the runtime support threading directly?
Regarding this point:
Document how to have inline calls for collection objects on the same thread (basically elide the RTS but have the call be recorded for LB timing and communication purposes)
Use proxy.invoke
instead of proxy.send
.
Thanks for the detailed answer! That definitely makes sense.
We don't need worker threads, though one MPI rank per core is not great because of intra-node MPI calls (I guess these can be zero-copy, but that also seems like a lot of work). Using Kokkos/etc is totally fine and probably what we will end up doing anyway given that the DOE machines are now all GPU-based. How does that interface with load balancing? Specifically, we currently have our computational domain (solving hyperbolic PDEs) chopped up into little cubes, then each core gets several cubes assigned, and the cubes can get moved around for load balancing (in the Charm++ implementation). With VT it seems we would want to have threads bound to cores to work on the cubes, and then use VT messaging for inter-node communication between cubes. It's totally fine if the conclusion is we need to somewhat manually do the LB of the cubes. We are actually working on that with Charm++ too because communication awareness is critical, we know the exact communication pattern (no need to try to infer it from messages, just use a space-filling curve), and we know exactly how expensive each cube is based on the number of grid points.
Does that give you an idea of what we are doing?
proxy.invoke
looks perfect, thanks! :D
What Needs to be Done?
Is your feature request related to a problem? Please describe.
Thanks for the great library! I started playing around with some of the examples and have really been struggling with the threading capabilities. Whenever I do
vt::theContext()->getWorker()
I get-275
. I'm also not sure how to actually set the number of workers correctly. I usedvt::initialize(argc, argv, vt::WorkerCountType{4});
, but then the code hangs insidevt::finalize();
. This was just changing some of thehello_world
examples to try and run with OpenMP threads.Describe potential solution outcome
Describe alternatives you've considered
Additional context