StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
668 stars 146 forks source link

weirdness around NUMA memory allocation on different machines #1338

Open rohany opened 1 year ago

rohany commented 1 year ago

I've been seeing some small annoyances around using numa allocations on some machines where I want to allocate relatively large amounts of NUMA memories. In particular, the NUMA API sometimes seems to report that there isn't enough NUMA memory available on a particular domain. For example, sapling (i believe) has 256 GB of host memory. I ran legion with -ll:nsize 100000 and saw the following error:

[0 - 7fd2dbb67000]    0.000000 {4}{numa}: insufficient memory in NUMA node 0 (104857600000 > 93673009152 bytes) - skipping allocation

However, just doing -ll:csize 200000 worked fine. On summit, the NUMA api seems to report varying amounts of memory and often fails with the same message when I use numa allocations that should add up to several gigabytes less than the total amount of memory on the machine (even accounting for space for the runtime). Overall, this is an annoying user experience, and I'm not sure what to do about it. Perhaps we should still allocate at least the amount of memory that the system thinks is available in a NUMA domain instead of failing?

cc @streichler for thoughts.

manopapad commented 1 year ago

Does setting -ll:ncsize 0 help?

rohany commented 1 year ago

no, this memory is one of the "real" numa memories, its size just seems to be slightly smaller than expected. -ll:ncsize helps for some auxiliary numa memories that have like 8 bytes or something

rohany commented 1 year ago

Also this error doesn't kill the program, it gives you out of memory errors later when mappers try to allocate into SYS_MEM instead of SOCKET_MEM, which are somewhat confusing to debug until you reailize that a numa allocation failed.

streichler commented 1 year ago

if you have numastat on that machine, can you run numastat -m on that machine and paste the output?

rohany commented 1 year ago
[\u@batch4.summit \W]\$ jsrun -n 1 -b none numastat -m

Per-node system memory usage (in MBs):
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
                          Node 0          Node 8        Node 250        Node 251        Node 252        Node 253        Node 254        Node 255           Total
                 --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
MemTotal               257698.06       261723.56        16128.00        16128.00        16128.00        16128.00        16128.00        16128.00       616189.62
MemFree                219698.12       250992.88        16128.00        16128.00        16128.00        16128.00        16128.00        16128.00       567459.00
MemUsed                 37999.94        10730.69            0.00            0.00            0.00            0.00            0.00            0.00        48730.62
Active                   7797.56          330.75            0.00            0.00            0.00            0.00            0.00            0.00         8128.31
Inactive                 4650.94           17.62            0.00            0.00            0.00            0.00            0.00            0.00         4668.56
Active(anon)             7714.12          292.44            0.00            0.00            0.00            0.00            0.00            0.00         8006.56
Inactive(anon)            710.31           10.19            0.00            0.00            0.00            0.00            0.00            0.00          720.50
Active(file)               83.44           38.31            0.00            0.00            0.00            0.00            0.00            0.00          121.75
Inactive(file)           3940.62            7.44            0.00            0.00            0.00            0.00            0.00            0.00         3948.06
Unevictable              8192.00         8192.00            0.00            0.00            0.00            0.00            0.00            0.00        16384.00
Mlocked                  8192.00         8192.00            0.00            0.00            0.00            0.00            0.00            0.00        16384.00
Dirty                       0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00
Writeback                   0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00
FilePages               11686.19           56.62            0.00            0.00            0.00            0.00            0.00            0.00        11742.81
Mapped                    325.88          187.50            0.00            0.00            0.00            0.00            0.00            0.00          513.38
AnonPages                8955.50         8483.75            0.00            0.00            0.00            0.00            0.00            0.00        17439.25
Shmem                    7661.56           10.88            0.00            0.00            0.00            0.00            0.00            0.00         7672.44
KernelStack                25.75           18.22            0.00            0.00            0.00            0.00            0.00            0.00           43.97
PageTables                  6.06            5.75            0.00            0.00            0.00            0.00            0.00            0.00           11.81
NFS_Unstable                0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00
Bounce                      0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00
WritebackTmp                0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00
Slab                     2634.94         1010.44            0.00            0.00            0.00            0.00            0.00            0.00         3645.38
SReclaimable              503.00           75.50            0.00            0.00            0.00            0.00            0.00            0.00          578.50
SUnreclaim               2131.94          934.94            0.00            0.00            0.00            0.00            0.00            0.00         3066.88
AnonHugePages               0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00
ShmemHugePages              0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00
ShmemPmdMapped              0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00
HugePages_Total             0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00
HugePages_Free              0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00
HugePages_Surp              0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00            0.00

seems like one of the numa memories generally has less than the other.

manopapad commented 1 year ago

Can you also report the output from numactl --hardware ?

rohany commented 1 year ago

I don't know why node 0 and node 8 have so much memory in use -- I requested an allocation and only ran numactl --hardware on it.

available: 6 nodes (0,8,252-255)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 260788 MB
node 0 free: 193331 MB
node 8 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 8 size: 261747 MB
node 8 free: 213069 MB
node 252 cpus:
node 252 size: 16128 MB
node 252 free: 16108 MB
node 253 cpus:
node 253 size: 16128 MB
node 253 free: 16123 MB
node 254 cpus:
node 254 size: 16128 MB
node 254 free: 16108 MB
node 255 cpus:
node 255 size: 16128 MB
node 255 free: 16123 MB
node distances:
node   0   8  252  253  254  255
  0:  10  40  80  80  80  80
  8:  40  10  80  80  80  80
 252:  80  80  10  80  80  80
 253:  80  80  80  10  80  80
 254:  80  80  80  80  10  80
 255:  80  80  80  80  80  10
manopapad commented 1 year ago

Might be something to ask the cluster admins about.