StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

Realm: Report warning (or error) when GPU memory cannot be pinned with NIC #1681

Open elliottslaughter opened 2 months ago

elliottslaughter commented 2 months ago

Some systems, notably Slingshot 11, cannot pin the full GPU's framebuffer with the NIC, especially in a 1 rank/node configuration. I have seen this on both Perlmutter and Frontier. The key signature of this failure mode is a backtrace that goes through _gex_RMA_PutNB. For example:

[6] #6  <signal handler called>
[6] #7  0x00007fb26be6d7d6 in gasnetc_rdma_put () from .../legion/language/build/lib/librealm.so.1
[6] #8  0x00007fb26be63ee9 in gasnete_put_nb () from .../legion/language/build/lib/librealm.so.1
[6] #9  0x00007fb26afc0555 in _gex_RMA_PutNB (_constantp_never_nbrhd=1, _flags=0, _lc_opt=<optimized out>, _nbytes=262144, _src=<optimized out>, _dest=<optimized out>, _rank=5, _tm=<optimized out>) at .../gasnet/release/include/ofi-conduit/gasnet_extended.h:112
[6] #10 Realm::XmitSrcDestPair::push_packets (this=0x7f88587a7020, immediate_mode=<optimized out>, work_until=...) at .../legion/runtime/realm/gasnetex/gasnetex_internal.cc:2416
[6] #11 0x00007fb26afc2ece in Realm::GASNetEXPoller::do_work (this=0x5d77720, work_until=...) at .../legion/runtime/realm/gasnetex/gasnetex_internal.cc:2944
[6] #12 0x00007fb26ae30fd8 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7fb21c1eae20, max_time_in_ns=<optimized out>, interrupt_flag=0x0) at .../legion/runtime/realm/bgwork.cc:599
[6] #13 0x00007fb26ae318b1 in Realm::BackgroundWorkThread::main_loop (this=0xd5e8ed0) at .../legion/runtime/realm/bgwork.cc:103

The issue is that Realm does not check the return code of gex_EP_BindSegment (which, admittedly, was only added in GASNet-EX 2022.3.3).

The short term patch that I am working with to turn this into a proper error is below. Note that is NOT intended for production, as Sean wants this to be a warning by default with an optional opt-in error flag:

diff --git a/runtime/realm/gasnetex/gasnetex_internal.cc b/runtime/realm/gasnetex/gasnetex_internal.cc
index 4a7833fb5..4e3f5bb2b 100644
--- a/runtime/realm/gasnetex/gasnetex_internal.cc
+++ b/runtime/realm/gasnetex/gasnetex_internal.cc
@@ -3406,7 +3406,12 @@ namespace Realm {
     assert(ep_index == xmitsrcs.size());
     xmitsrcs.push_back(new XmitSrc(this, ep_index));

-    gex_EP_BindSegment(ep, segment, 0 /*flags*/);
+    gex_System_SetVerboseErrors(1);
+    if (gex_EP_BindSegment(ep, segment, 0 /*flags*/) != GASNET_OK) {
+      log_gex_bind.fatal() << "failed to bind segment";
+      abort();
+    }
+    gex_System_SetVerboseErrors(0);

     uintptr_t base_as_uint = reinterpret_cast<uintptr_t>(base);
     segments_by_addr.push_back({ base_as_uint, base_as_uint+size,

This also relies on the newer GASNet return value from gex_EP_BindSegment and thus is not compatible with GASNet prior to 2022.3.0. I don't know how much or if we care about that.

If we do want this to be portable to earlier versions of GASNet, a more portable solution might look something like:

--- a/runtime/realm/gasnetex/gasnetex_internal.cc
+++ b/runtime/realm/gasnetex/gasnetex_internal.cc
@@ -3407,6 +3407,8 @@ namespace Realm {
     xmitsrcs.push_back(new XmitSrc(this, ep_index));

     gex_EP_BindSegment(ep, segment, 0 /*flags*/);
+    // check for possible failure of gex_EP_BindSegment():
+    assert(gex_EP_QuerySegment(ep) == segment);

     uintptr_t base_as_uint = reinterpret_cast<uintptr_t>(base);
     segments_by_addr.push_back({ base_as_uint, base_as_uint+size,

Again, adjusting for Realm code style and desired error behavior.

Note that, in either case, the patches above just turn this into a proper error. Therefore, if you want to actually work around the issue, you must do one of:

  1. Reduce the amount of memory you're attempting to pin (e.g., via -ll:fsize)
  2. Increase the number ranks per node (e.g., from 1 to 2 is sufficient to make my use case work on Frontier)

This appears to be responsible for one major class of issues in https://github.com/StanfordLegion/gasnet/issues/27

lightsighter commented 2 months ago

I think the polarity here is backwards: it should be an error by default with an option for users that know what they are doing to relax it to a warning if they are ok with the (major) performance repercussions that occur when you can't RDMA all your data directly into/out-of the GPU from the NIC.

@PHHargrove @bonachea Are there other GASNetEX API calls that have added return codes since earlier versions that we should also be checking? Does GASNet provide an option to add an attribute that will warn about ignored return values so we can confirm we're checking all of them?

bonachea commented 2 months ago

Are there other GASNetEX API calls that have added return codes since earlier versions that we should also be checking?

@lightsighter Changes to the GASNet-EX API in each release are meticulously documented in the GASNet ChangeLog, in sections titled "GASNet-EX Spec v0.X", which also indicates the GEX_SPEC_VERSION_{MAJOR,MINOR} version number macros that can be checked programmatically for the given change. The details are documented in the GASNet-EX Spec document.

The gex_EP_BindSegment() change in release 2022.9.0 appears to be the only "recent" change to an existing API return value, but releases routinely add new functionality in various forms; again the ChangeLog is intended to keep clients apprised of such changes.

Does GASNet provide an option to add an attribute that will warn about ignored return values so we can confirm we're checking all of them?

Not currently but thanks for the good idea!

elliottslaughter commented 2 months ago

One more dimension of this is gex_System_SetVerboseErrors(1). You can see an example of calling the API in my first diff at the top of this issue. This API call causes GASNet to print the original error message we get from libfabric, which can be helpful (maybe) in diagnosing failures.

Here's an example of what an error looks like when we are unable to pin the GPU framebuffer:

cxil_map: write error
*** WARNING (proc 1): Unexpected error -28 (No space left on device) from fi_mr_regattr() when binding segment [0x7fd2c0000000, 0x7fd6e6800000) to EP 8
*** WARNING (proc 1): GASNet gex_EP_BindSegment returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  from function gasnetc_ep_bindsegment_hook(i_ep, i_segment, flags)
  at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank6-final-gpu-kinds/gasnet/GASNet-2023.9.0/gasnet_mmap.c:2065
  reason: GASNet encountered an error: GASNET_ERR_RESOURCE(10002)
[1 - 7fffe8647740]    0.000000 {6}{gexbind}: failed to bind segment
*** Caught a fatal signal (proc 1): SIGABRT(6)

It's not super informative, but it at least gives you an error code, "No space left on device", and the specific libfabric API call that failed.

On the one hand, it doesn't integrate with Realm's logger infrastructure. On the other hand when things are failing and we don't know why, that's not helpful either.

I just mention it as something else we should consider when fixing this.

lightsighter commented 2 months ago

That looks fine to me. This is before Realm has fully started anyway.

apryakhin commented 2 months ago

If we want it to be back-portable and be able to relax the crash-on-bind-failure, we should probably go with something like this?

    xmitsrcs.push_back(new XmitSrc(this, ep_index));

    gex_EP_BindSegment(ep, segment, 0 /*flags*/);
    // check for possible failure of gex_EP_BindSegment():
    if(gex_EP_QuerySegment(ep) != segment) {
       log_gex_bind.warning() << "failed to bind segment";
       if (abort_on_bind_failure) {
          ...
          abort();
       }
       return false;
    }

    uintptr_t base_as_uint = reinterpret_cast<uintptr_t>(base);
    segments_by_addr.push_back({ base_as_uint, base_as_uint+size,
elliottslaughter commented 2 months ago

We decided in the Legion meeting today that we want this to be error-by-default, maybe with an optional flag to downgrade to a warning.

I can put in a MR but I actually don't know how to thread through a flag. If we want the ability to downgrade the error to a warning, I'll need some help. @apryakhin

elliottslaughter commented 2 months ago

MR here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1244

elliottslaughter commented 2 months ago

The MR has been merged. I consider this a good enough fix for my purposes, but if someone would like to add an optional downgrade to a warning, they can feel free to do so.