Closed gbtitus closed 2 years ago
@gbtitus I think we should unify the aprun and slurm HEAP_SIZE soon. I think we should just default to 16GB everywhere for now -- @benharsh was bit by this when trying to run on an aprun system.
I'll do that today.
Seeking feedback:
Gemini NICs have 16k internal page table entries and you can't register more memory than that many pages (causes an error). Aries NICs have a 16k-entry page table cache, and you can register more memory than that but if your reference pattern causes cache refills then performance will be reduced by up to ~15% in the worst case. Currently for both NICs, if you set CHPL_RT_MAX_HEAP_SIZE
to > 16k pages we reduce your request so it fits in 16k pages and emit a warning. As part of this change I'd like to do that reduction for Gemini only. So on Aries, you could oversubscribe the TLB cache if you insisted. We'd still emit a warning for both NICs, though, so you'd know what happened.
Unfortunately this effort seems to have hurt ISx performance at scale. I see a %20 regression at 64 nodes.
Edit: %20 regression at 128 nodes as well.
Where's my crying emoji, GitHub? :'(
No crying in HPC! We already knew the linear executeOn registration update broadcasting was performance-suspect. There's an item in the list for that.
I can run ISx all the way up to 256 nodes on XC-16 with a 48g heap, 128k stack size, and 5 trials. I got the following reported totals:
Nodes | Seconds |
---|---|
2 | 8.3 |
4 | 8.4 |
8 | 9.8 |
16 | 10.2 |
32 | 12.0 |
64 | 12.1 |
128 | 14.5 |
256 | 18.7 |
There's are obvious bumps up at 32 and 128 and 256 nodes. This is probably due to the dynamic registration broadcast overhead, that is, the "broadcast scaling" item in the list for this issue. Note also the need for a 48g heap. It turns out that this is because we run out of dynamic registration table space and have to revert to getting array memory from the heap (so we need a big heap). The registered memory region table has a fixed size of 100 entries. Based on conversation with @benharsh the number of arrays is related to the number of cores on the compute nodes, and it also seems to be related to the number of trials. Fixing this will require making the table size dynamic and improving the table access algorithm and the registration broadcasting.
I hacked comm=ugni to push the size of the memory registration table to 1000 entries (from 100), and then ISx would run even with a little 1g heap (12.0 secs for 32 nodes). Alas, on 256 nodes it wouldn't run even with a 16g heap; it segfaulted. I'll dig into this deeper after I can put some other stuff I'm in the middle of aside (max 1 day).
I've been doing some runs with both a uGNI performance analyzer I wrote in C and a couple of our existing RA tests with a modified statistics-collecting ugni comm layer. It looks like we should be able to do smallish PUTs (I tested 32-byte ones) 10x as fast our current best rate for "fast" executeOns. Based on this, I'm going to rework the dynamic registration broadcast logic to PUT the registration updates directly to the remote nodes' registration maps rather than using executeOn. Note that getting this 10x improvement will require using uGNI's new PostFMA chained transaction feature, so this will require adding that support as well. (This will help with the "registration update broadcast logic needs to be scalable" item in the list of things to do for this issue.)
Scalable broadcast is addressed by #7169. It's not the final word, but it should be better.
Here's some more info about allocating hugepages with HUGETLB_NO_RESERVE=yes
and the SIGBUS
we get when touching that memory, if we've run out.
The CLE change we expect will allow us to get rid of the HUGETLB_NO_RESERVE
workaround is in CLE 6.0.UP05. We haven't had a chance to try this out yet. A useful doc about how to find out the CLE version number is here (this info is public). The executive summary is: look in /etc/motd
or /etc/opt/cray/release/clerelease
.
FYI on CLE6 UP05 (and UP06) I have verified that HUGETLB_NO_RESERVE is no longer needed.
Next step is probably to check if setting HUGETLB_NO_RESERVE on up05 can still result in a SIGBUS on touch. i.e. can we just always set HUGETLB_NO_RESERVE, or should we only set it for for cle5 and cle6 < up05 to avoid SIGBUS on newer OS's
The SIGBUS-on-touch with HUGETLB_NO_RESERVE
is a general kernel feature, not peculiar to CLE. With no-reserve the kernel can't know at allocation time whether there will be real memory at touch time (arbitrarily far in the future) to back the virtual memory you're acquiring. So when the touch finally happens and it discovers there isn't real memory to back the virtual page, a signal is the only way it can communicate with the process. SIGBUS
was chosen because its general meaning is "no physical entity in the system responds to that address". So the SIGBUS
will still happen with HUGETLB_NO_RESERVE
in CLE6.0.UP05 and beyond when we're out of memory.
Ok, makes sense. In that case we should figure out the best way to check for up05 (or whatever is the best way to check that the bug is fixed) and only set HUGETLB_NO_RESERVE when we need to.
My initial thought was to parse /etc/opt/cray/release/CLEinfo
or /etc/opt/cray/release/cle-release
like we do in chpl_platform.py
, but I'm not sure if that's the best solution (or if the RELEASE
format is reliable enough to parse)
Yeah, that was where I was leaning also. That seems to be the solution people use when they have OS-version-specific code. And we don't have to write something that will work forever; I think if we can't find those files or their format doesn't match what we expect we can just say "oh, the bug's not present here and we don't have to use NO_RESERVE".
Yea, that's a good point
HUGETLB_NO_RESERVE
workaround.Closing, ugni is effectively in maintenance mode at this point now that we're shifting our focus to EX and I don't think we'll have much time or motivation to do additional feature development. For this specific issue, all the large issues were taken care of and it was only smaller clean ups that were left anyways.
As of PR #6947, we dynamically register large array memory with the Gemini/Aries NICs when using the ugni comm layer. This issue enumerates and tracks improvements and other follow-up work for that feature.
Some things we could do are:
[x] Cleanup the
array_alloc
interface. The "call me back" aspect of this functionality is an obvious hack.[x] Clarify the status and workability of any upcoming kernel fixes to the problem we're "fixing" by using
HUGETLB_NO_RESERVE=yes
.[x] When we're out of memory, the
HUGETLB_NO_RESERVE=yes
workaround causes us to succeed on the allocation but get aSIGBUS
on the touch, when the kernel tries to actually cobble pages together. If that workaround needs to stay in place for any length of time, look into catching the SIGBUS and reporting a clean error message when we're out of memory.[ ] It may be desirable to allow users to enable/disable this feature by means of a config const, presumably in ChapelBase.
[ ] The array size threshold is probably too big. It's currently
numNumaDomains * hugepageSize
. We could drop all the way down to 1 hugepage and still only be wasting 50% of the memory (on a 2-domain compute node). If we did this we would need to be able to handle allocation addresses and lengths that weren't hugepage-aligned, because we'd be returning array addresses that were in the interior of the hugepages we'd be getting from libhugetlbfs.[x] Make the max number of dynamic registrations tunable (for context, see https://github.com/chapel-lang/chapel/pull/6947#discussion_r131735802)
[x] The comm=ugni registration update broadcast logic needs to be scalable.
[x] Unify default heap size for slurm and aprun (related to #5678)
[x] We could expand this support to dynamically register everything, by reworking the jemalloc chunk allocation (and the cstdlib memory layer
morecore
hook?) to get memory from the comm layer by the new interface instead of the old pre-allocated fixed heap one. Then we could have NUMA-sensitivity for things like stacks and smaller heap-allocated data. And nobody would have to setCHPL_RT_MAX_HEAP_SIZE
any more, at least not for this configuration.[ ] It might be nice to add a way to set "interleaved" memory binding for the hugepages, to improve performance for applications that are not sensitive to sublocale (NUMA) locality but can benefit from the added bandwidth when all the NUMA memory pathways are used.