StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
683 stars 145 forks source link

Realm: Avoiding "reservation cannot be satisfied" warning when pinning is impossible #1718

Open manopapad opened 3 months ago

manopapad commented 3 months ago

Currently, if I request any processors that are unconditionally pinned (e.g. -ll:cpu or -ll:py), but the system does not support core pinning (e.g. MacOS, or a containerized run) then Realm will print out this warning message:

[0 - 7afa98f59c40]    0.089034 {4}{threads}: reservation ('Python-1 proc 1d00000000000003') cannot be satisfied

This message tends to be cryptic to Legate end-users, e.g. see https://github.com/nv-legate/cunumeric/issues/1104. We would like to avoid this error message if pinning is impossible on the current machine.

We could always disable warnings from the "threads" logger, but I'd prefer not to blanket-disable all warnings.

One suggested workaround is as follows:

elliottslaughter commented 3 months ago

Can we make this simpler and silence the warning in debug mode? (And no other checks about whether it's "possible" to pin.)

Otherwise I worry that we'll be shutting off the warning in cases that really matter. For example, who's to say that users don't care about performance when they run in Docker? At some point someone will do that, fail to see the warning, and not realize they're not getting any sort of pinning. This situation is far harder to identify than e.g. profiling in debug mode; we're talking about trading something that is trivial to identify for something that has a lot of complexity involved.

I guess it would be ok to shut off the warning in macOS unconditionally since no one does performance runs there, but I would want the warning to be enabled all the time on Linux in release mode.

manopapad commented 3 months ago

I'm looking for a solution that covers release mode as well (I don't have an opinion about disabling the warning in debug mode).

Note that I'm not suggesting we turn off the warning. The Realm team has been explicit that requesting pinning (even implicitly) will always produce the warning if the pinning fails.

My problem is that the warning is cryptic to anyone who doesn't speak Realm. I would therefore want a way to detect problematic cases (either [1] pinning is impossible, or [2] the user asked for more processors than what the system can accommodate), report them at a level of abstraction that Legate users can understand (print nothing for case [1], as it's not actionable, and print a meaningful message using Legate-level options in case [2]), and in both these cases configure Realm so it doesn't feel the need to print its own warning.

Another alternative that has been suggested in the past would be for Realm to provide callbacks for higher-level code to have the ability to interpose these warnings, and provide an alternative message.

muraj commented 3 months ago

I would bring this up in the next realm meeting and ask for a target release. I think the callback method is a better way to deal with this in general and migrates people away from relying on Realm's logger.

elliottslaughter commented 3 months ago

report them at a level of abstraction that Legate users can understand

This part I agree with wholeheartedly.

print nothing for case [1], as it's not actionable

This part I strongly object to. There is no way for us to know if this is actionable or not. For example, if the user is running a containerized build, maybe this is outside of their direct control (e.g., require sysadmin intervention). But then contacting a sysadmin might be exactly what the user needs to do to get the performance they want. We shouldn't give up just because it's hard. We should report problems when we know they exist.

Even in the case of macOS, the action is "don't do performance runs on macOS". That's still important information for the user to know, especially if they got the (mistaken) impression that they can do meaningful performance comparisons on that platform.

I strongly agree that we should do the best we can to explain what happened in an actionable way, but these are exactly the sorts of things we need to be exposing because otherwise users silently get into bad cases and often have no idea they're even missing an optimization they need.

manopapad commented 3 months ago

Good point, I will revise my goal to distinguish between cases [1] and [2] from above, print out a warning on both cases (different for each case) at the right abstraction level, and skip the Realm warning. I will follow up with the Realm team at an upcoming meeting to discuss the options for enabling this route on the Realm side.

apryakhin commented 3 months ago

Just to update - the decision was to go with option (1), we allow to register a callback in Realm and let legate to handle the warning. If no callback is registered, Realm will report the warning.