chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.76k stars 414 forks source link

[Bug]: jemalloc issues with lots of cores (HPE Superdome) #24736

Open hpcpony opened 3 months ago

hpcpony commented 3 months ago

I'm seeing problems with chapel/jemalloc on a machine with lots of cores (HPE Superdome w/ 1568 cores including hyperthreading). I know this is kind of an unusual case, but I thought I'd mention it.

[host7:Chapel] uname -a
Linux host7 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 15 07:18:13 EST 2024 x86_64 x86_64 x86_64 GNU/Linux

[host7:Chapel] /opt/CHAPEL/chapel-2.0.0_host7/util/printchplenv
machine info: Linux host7 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 15 07:18:13 EST 2024 x86_64
CHPL_HOME: /opt/CHAPEL/chapel-2.0.0_host7 *
script location: /opt/CHAPEL/chapel-2.0.0_host7/util/chplenv
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native +
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet +
  CHPL_COMM_SUBSTRATE: smp +
  CHPL_GASNET_SEGMENT: fast +
CHPL_TASKS: qthreads
CHPL_LAUNCHER: smp
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled +
CHPL_HWLOC: bundled
CHPL_RE2: bundled +
CHPL_LLVM: bundled +
CHPL_AUX_FILESYS: none

[host7:Chapel] chpl --version
<jemalloc>: Reducing narenas to limit (4094)
chpl version 2.0.0
  built with LLVM version 17.0.6
  available LLVM targets: x86-64, x86
Copyright 2020-2024 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)

Test case:

[host7:Chapel] cat x.chpl
writeln("Hello");

[host7:Chapel] chpl x.chpl
<jemalloc>: Reducing narenas to limit (4094)
<jemalloc>: Reducing narenas to limit (4094)
<jemalloc>: Reducing narenas to limit (4094)

[host7:Chapel] ./x -nl 2
<jemalloc>: Reducing narenas to limit (4094)
internal error: could not change current thread's arena
internal error: could not change current thread's arena

Poking around it looks like there are a number of ways to limit the number of arenas. I'm not sure it's any better than just letting jemalloc reduce it to 4094 but it's explicit.

[host7:Chapel] export MALLOC_CONF='narenas:2048'
[host7:Chapel] chpl x.chpl

[host7:Chapel] ./x -nl 2
<jemalloc>: Reducing narenas to limit (4094)
internal error: could not change current thread's arena
internal error: could not change current thread's arena

Problem 1 is that jemalloc doesn't seem to know how to deal with this many cores. I think the default is to create 4 x cores = narena (6272 in my case) but it looks like there's only 12 bits internal to use for arenas.

.../third-party/jemalloc/jemalloc-src/include/jemalloc/internal/jemalloc_internal.h.in:#define MALLOCX_ARENA_MAX 0xffe

Poking around in jemalloc 5.3 I think it still has this limitation.

You can apparently reduce the number of arenas (as I did above) but I'm unclear whether that's a reasonable thing to do.

Problem 2 is that if you do reduce the number of arenas it appears to fix jemalloc's complaint, but there's still something in chapel runtime that doesn't work quite right. Interestingly if I build a chapel application on a different machine (using a chapel compiler built on that other machine), I can run it on the machine with lots of core with no "reducing" and no "internal error:".

We're still mostly experimenting with chapel so prioritize as appropriate.

lydia-duncan commented 3 months ago

I don't have the expertise to resolve this issue myself, but I suspect a question we'll want the answer to is "what is the output of printchplenv on the system where you built the application that worked when migrated to the big machine?" I do suspect the problem is specific to jemalloc but it's weird that a migrated program can work just fine (so maybe there's some built in limitations that would be apparent from printchplenv, or maybe not)

bradcray commented 3 months ago

@hpcpony: Something you might experiment with is to use the C-based layer for Chapel's memory needs, using:

export CHPL_MEM=cstdlib
export CHPL_HOST_MEM=cstdlib

This would obviously avoid any jemalloc-specific limits or behaviors, though we don't have enough experience with Chapel on Superdome to know whether this will be a net win or loss.

Since you mention Superdome, I'll mention that @jhh67 has recently been developing co-locales in which multiple locales are mapped to a given compute node. E.g., on a compute node with two sockets or two NICs, you could run -nl 2x2 in order to run 2 locales (processes) per node and have them divide up the available hardware reasonably. I've been imagining that co-locales would be a good match for Superdome but, to my knowledge, none of us have had the opportunity to give them a try yet (and if your use case is a good motivator for us to do so, please let us know).

If it did work, it seems as though if you ran a number of co-locales such that the number of cores per co-locale was less than 4094, the warning and any potential downsides it might involve should go away. For more information on co-locales, please see: https://chapel-lang.org/docs/usingchapel/multilocale.html#co-locales.

bradcray commented 3 months ago

@hpcpony : This is off-topic for your issue, but the only way I know of to reach you. I wanted to check in and make sure that you've heard about ChapelCon'24 (https://chapel-lang.org/ChapelCon24.html), and to encourage you to submit a talk or demo if you have some work that you'd like to share with the community. For example, just thinking about this issue, "early experiences with Chapel on Superdome" would be something that definitely hasn't been covered there before. The deadline's coming up fast (April 12), but is also designed to be lightweight (1 page max abstract). Even if you don't submit something, we hope to "see" you there, virtually. Best wishes, -Brad

hpcpony commented 3 months ago

I think we're a long way from having something to present, but it sounds interesting. ;^)

I did some more experiments. I rebuilt chapel on the superdome with cstdlib vs. jemalloc. The build went fine and I can now compile/run something on superdome, but it doesn't seem to handle SMP locales. Here's the little test program I'm using.

writeln("numLocales ", numLocales);
for loc in Locales do
  on loc do {
    writeln("[",loc,"] here.id: ", here.id);
    writeln("[",loc,"] here.name: ", here.name);
  }

Compile and run on the superdome (cstdlib):

[superdome:Chapel] which chpl
/opt/CHAPEL/chapel-2.0.0_superdome/bin/linux64-x86_64/chpl
superdome:Chapel] chpl --version
chpl version 2.0.0
  built with LLVM version 17.0.6
  available LLVM targets: x86-64, x86
Copyright 2020-2024 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)
[superdome:Chapel] chpl here.chpl
[superdome:Chapel] ./here -nl 3
numLocales 3
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0

the chplprintenv for the superdome version is:

HPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native +              <- superdome is a cascade lake chip
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet +
  CHPL_COMM_SUBSTRATE: smp +
  CHPL_GASNET_SEGMENT: everything +    <- there was documentation that said this was necessary
CHPL_TASKS: qthreads
CHPL_LAUNCHER: smp
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: cstdlib +
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled +
CHPL_HWLOC: bundled
CHPL_RE2: bundled +
CHPL_LLVM: bundled +
CHPL_AUX_FILESYS: none

chplprintenv doesn't show it but I also set CHPL_TARGET_MEM=cstdlib and CHPL_HOST_MEM=cstdlib before the build.

The really interesting part is that I built another completely separate version of chapel on an old CentOS7 box (using jemalloc).

CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: unknown +             <- centos7 is a Xeon E5-2637
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet +
  CHPL_COMM_SUBSTRATE: smp +
  CHPL_GASNET_SEGMENT: fast +
CHPL_TASKS: qthreads
CHPL_LAUNCHER: smp
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc +
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled +
CHPL_HWLOC: bundled
CHPL_RE2: bundled +
CHPL_LLVM: bundled +
CHPL_AUX_FILESYS: none

Doing the same thing over there my little program compiles and runs correctly. And, if I take the executable compiled on the centos7 and run it on superdome it now works fine:

[superdome:Chapel] ./here -nl 3
numLocales 3
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE1] here.id: 1
[LOCALE1] here.name: superdome-1
[LOCALE2] here.id: 2
[LOCALE2] here.name: superdome-2

Additionally, if I use the compiler built on centos7 w/ jemalloc (they're all NFS mounted across machines) and use it on superdome to build my test application, the result also works.

[superdome:Chapel] which chpl
/opt/CHAPEL/chapel-2.0.0_centos7/bin/linux64-x86_64/chpl
[superdome:Chapel] chpl --version
chpl version 2.0.0
  built with LLVM version 17.0.6
  available LLVM targets: x86-64, x86
Copyright 2020-2024 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)
[superdome:Chapel] chpl here.chpl
[superdome:Chapel] ./here -nl 3
numLocales 3
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE1] here.id: 1
[LOCALE1] here.name: superdome-1
[LOCALE2] here.id: 2
[LOCALE2] here.name: superdome-2

I can't compile on superdome and run on centos7 because superdome is RH9 and the chpl compiler built there is based on newer GLIBC.

We've certainly got enough working options to do the experimentation we want to, but this is just one of those things that seems pretty weird and counterintuitive, so I thought I'd flesh it out.

P.S. I will point out that the one local hack I do is change third-party/llvm/Makefile and remove "AArch64;NVPTX;AMDGPU" but I did that for both builds so that seems benign.

bradcray commented 3 months ago

@hpcpony : That is puzzling to me, and I don't have a guess as to why you're seeing that difference offhand. I believe that @jhh67 is the expert on how we come up with these locale name strings. John, do you have any insight into why Chapel's locale names are distinct in the one case (superdome-0,-1`, etc.) yet not in the other? I'm not coming up with any guesses...

@hpcpony: The thing I would be most interested in is what sort of behavior you get if you run a configuration (using the version compiled natively for superdome) more like -nl 1x4, -nl 1x4socket or -nl 1x4llc? These use the co-locale feature I mentioned earlier and say to run on a single node (treating the entire superdome as a big fat node, essentially, which from Chapel's perspective it is), but then telling it to run 4 co-locales in the first instance (it gets to choose where and how big they are); 4 co-locales, with one per socket in the second; and 4 co-locales, with 1 per last-level cache line in the third). IIRC, simply running -nl 4 (say) on a shared memory system like Superdome will not do anything particularly smart about carving up the system resources between the locales in a locality-sensitive way, whereas the co-locale invocations will work a lot better in terms of sharing and dividing up the resources in a reasonable way (though, to my knowledge, we haven't tried this feature on Superdome yet, so I'm essentially asking you to be the guinea pig here).

I think you're also saying that when you do the compile on centos7 and run on superdome, you're not longer seeing the jemalloc issues? I'm not quite sure what to make of that one either, but I'm noticing for the first time that you were getting jemalloc warnings not just at execution time, but also at compile-time in this transcript above:

[host7:Chapel] chpl x.chpl
<jemalloc>: Reducing narenas to limit (4094)
<jemalloc>: Reducing narenas to limit (4094)
<jemalloc>: Reducing narenas to limit (4094)

That suggests that jemalloc is doing some sort of check of the system's resources at compile-time, which I would not have expected. But if that's the case, it would make sense that compiling on a smaller system might silence that issue, potentially by embedding some other (too-small?) value into the binary?

hpcpony commented 3 months ago

Here ya go....

[superdome:Chapel] which chpl
/opt/CHAPEL/chapel-2.0.0_superdome/bin/linux64-x86_64/chpl
[superdome:Chapel] cat here.chpl
writeln("numLocales ", numLocales);
for loc in Locales do
  on loc do {
    writeln("[",loc,"] here.id: ", here.id);
    writeln("[",loc,"] here.name: ", here.name);
  }
[superdome:Chapel] chpl here.chpl
[superdome:Chapel] ./here -nl 3
numLocales 3
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[superdome:Chapel] ./here -nl 1x4
numLocales 4
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[superdome:Chapel] ./here -nl 1x4socket
warning: 756 cores are unused
numLocales 4
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[superdome:Chapel] ./here -nl 1x4llc
warning: 756 cores are unused
numLocales 4
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
[LOCALE0] here.id: 0
[LOCALE0] here.name: superdome-0
jhh67 commented 3 months ago

I'm not that familiar with the GASNet smp conduit, but looking through the source code it appears that you need either CHPL_GASNET_SEGMENT=fast or CHPL_GASNET_SEGMENT=large. I think it worked on the CentOS machine because you have CHPL_GASNET_SEGMENT=fast and it didn't work on the Superdome machine because you have CHPL_GASNET_SEGMENT=everything. I don't know why these settings are necessary or why it doesn't trigger an error if they are configured incorrectly. Please give this a try and let me know whether or not it works.

hpcpony commented 3 months ago

I had gone with CHPL_GASNET_SEGMENT=everything because I just happened to stumble across the "Note" relating to CHPL_MEM https://chapel-lang.org/docs/usingchapel/chplenv.html#chpl-mem:

Certain CHPL_COMM settings (e.g. ugni, gasnet segment fast/large, ofi with the gni provider) register the heap to improve communication performance. Registering the heap requires special allocator support that not all allocators provide. Currently only jemalloc is capable of supporting configurations that require a registered heap.

Since I wasn't using jemalloc I figured fast/large were not valid.

jhh67 commented 3 months ago

That is a good point, I forgot about that limitation. That combination of settings should probably cause an error message.

At this point maybe the best option is to go back to using jemalloc and CHPL_GASNET_SEGMENT=fast or CHPL_GASNET_SEGMENT=large, and use co-locales to limit the number of cores per locale. You can use the syntax -nl NxL where N is the number of nodes and L is the number of locales per node. Choose L so that the number of cores per locale is below the jemalloc limit.

jhh67 commented 3 months ago

If you try to run a program compiled with CHPL_MEM=cstdlib and CHPL_GASNET_SEGMENT=fast you get a runtime error:

error: Your CHPL_MEM setting doesn't support the registered heap required by your CHPL_COMM setting. You'll need to change one of these configurations.

But it would be nice if you got the error when trying to build that configuration, or maybe even a warning from printchplenv.