StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 146 forks source link

Legion: profile mpi handshake #1736

Open syamajala opened 1 month ago

syamajala commented 1 month ago

Currently when you profile a legion application that uses an mpi handshake, you end up with gaps in the profile when legion does a handoff to mpi. It would be nice if the profile had a box that shows when the time is being spent outside of legion. It should probably begin on legion_handoff_to_mpi and end on mpi_handoff_to_legion.

lightsighter commented 1 month ago

Which processor or channel would you render this on? Can you do a hand drawn sketch or something that shows what you think it should look like?

elliottslaughter commented 1 month ago

We could add an "MPI" processor in the profile. After all, MPI uses its own threads for doing compute, and in principle Legion can go off and do other things at the same time (if the network behaves itself...). It doesn't make sense to show on any of the existing Legion slots but I think it would be fair to render as its own kind in the profile. And time spent blocked in MPI really is time spent in a real sense, so it doesn't seem inaccurate.

lightsighter commented 1 month ago

I agree that adding an "external" channel is probably the right way to go. (I wouldn't say it is MPI specifically.) I guess we can just time when the handshakes are triggered and render that as "external" time.

elliottslaughter commented 1 month ago

I'm fine with calling it "external".

In the case of MPI, I think the abstraction works well: assuming that MPI is running 1 thread per rank, then the handoff to MPI unblocks the MPI thread and allows it to start running. MPI could potentially communicate, but due to the blocking nature of MPI, the CPU thread will still be occupied during this time. When MPI finally hands back to Legion then the MPI thread blocks and stops running on the CPU. And so the handoff-to-handoff time does actually represent actual utilization of the CPU (possibly blocked waiting on MPI communication, but still using that resource).

This means it makes sense to draw a utilization plot for the external processor since there is some resource being used.

I'm not sure the abstraction holds for every possible choice of external programming model. If we were to handoff to, say, Charm++, then there would not necessarily be a thread running for the entire duration of the time between handoffs. We could still render it as a box on the profile but we'd have to be more careful to note that the box doesn't necessarily indicate utilization of a specific CPU resource. I guess it would still accurately model Legion being blocked, just not usage of the thread or core.

Anyway, I think this is good enough for now and we can go ahead with this model. The MPI case works and we can discuss other cases if they come up in practice.

lightsighter commented 1 month ago

Even MPI might not be running the whole time if it blocks in something like an all-reduce or something. I'm fine though representing the external execution with a solid box though on an "external" processor. Let me know what data you want me to dump out for the profiler.