charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
202 stars 49 forks source link

multicore-darwin-arm8 exhibits poor scaling #3469

Open geoffrey4444 opened 3 years ago

geoffrey4444 commented 3 years ago

Hi,

I'm one of the developers for the numerical-relativity code spectre (https://spectre-code.org/), which relies on charm++. We support macOS, but I'm the first to try building it on an M1 (Apple Silicon, arm64) Mac. When I tried building charm++ 7.0.0-rc1 with ./build charm++ multicore-darwin -j16 --with-production as smart-build.pl suggested, I get an error that multicore-darwin is not supported.

I wonder if there might be a way forward to attempt to build charm++ on an M1 Mac? I'd be grateful for any hints or advice you might be able to give me!

evan-charmworks commented 3 years ago

Hi Geoffrey,

Please try the build tuplet multicore-darwin-arm8 to build as ARM64.

Also, thanks for mentioning smart-build.pl. If it recommended multicore-darwin then that is a bug we need to fix.

evan-charmworks commented 3 years ago

I've opened #3470 to resolve the smart-build issue.

geoffrey4444 commented 2 years ago

Thanks! I got charm++ to build. However, I have a followup question: how do I ensure that macOS runs my job (launched with charm run) on the faster performance cores, instead of on the slower, efficiency cores? Is this the default if I just ask for four cores?

geoffrey4444 commented 2 years ago

Please disregard my previous question: I've verified with Activity Monitor that when I run with +p4 it runs on four performance cores.

I have noticed, however, that when benchmarking spectre (I'm part of the team developing spectre, which relies on charm++ for its parallelization), performance actually decreases on the M1 Mac as I increase the number of processor cores. Single-core performance is comparable, but as I increase the number of cores, the time to complete my sample job increases. This is not the case on my Intel Mac, where performance scales roughly as I'd expect with increasing numbers of cores (e.g., decreases by about a factor of two when going from 1 to 2 cores).

I wonder if you have any advice about performance on M1 Macs or arm8 when running on multiple cores?

evan-charmworks commented 2 years ago

I'm glad the build is working for you. Our experience with ARM64 until now has mostly been with low-power boards such as Raspberry Pi, so it is likely that Charm++ itself needs some tuning for within-node performance on the M1. If possible, could you share the steps needed to run a sample job with SpECTRE so that we can analyze it directly?

geoffrey4444 commented 2 years ago

Hi @evan-charmworks! Thank you very much for being willing to help me debug what's happening!

Below are detailed instructions on how to build the spectre executable that I'm using for my tests: it evolves a single black hole for a few timesteps. When I run on one core on an M1 iMac, the walltime is 17.4 seconds, but running on 4 cores, the walltime is 113.7 seconds.

If you have any insights on why this might be, they would be most welcome! Please also let me know if any of these steps don't work for you. I just ran through them myself to check them, but it's possible I made a mistake or missed something specific to the machine I'm using.

How to install and run Spectre on Apple Silicon

Here are steps to build a spectre test executable on an Apple Silicon Mac. These steps worked for me on a brand new Apple Silicon iMac. Please let @geoffrey4444 know if you have trouble.

0. Install the xcode command-line tools.

Install the xcode command-line tools, which include the clang compiler, etc.

xcode-select --install

1. Make a directory to install prereqs

First, make a directory to hold some prerequisites that spectre depends on. Name this directory whatever you like; I chose $HOME/apps.

export SPECTRE_DEPS_ROOT=$HOME/apps
mkdir $SPECTRE_DEPS_ROOT
cd $SPECTRE_DEPS_ROOT
mkdir src
cd src

2. Install python dependencies

Spectre depends on python and some python packages. Miniforge is a way to install an arm64-native python stack.

curl -L https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh > Miniforge3-MacOSX-arm64.sh

# Install miniforge. Accept all default choices
bash Miniforge3-MacOSX-arm64.sh
# Activate conda base environment for current session
eval "$(/Users/gwpacworkstation/miniforge3/bin/conda shell.zsh hook)"
# Activate conda at startup (to make sure you're always using native python)
conda init zsh
exit # close shell. conda will automatically load when new zsh shells start.

In a new shell (new terminal window), do the following to get back to where we were:

export SPECTRE_DEPS_ROOT=$HOME/apps
cd $SPECTRE_DEPS_ROOT/src

Now, install the necessary python packages. (Note: jupyter and notebook are n ot necessary to install, but I prefer installing them any time I'm setting up a new machine.)

conda install numpy scipy matplotlib h5py jupyter notebook

3. Install dependencies with Homebrew

Most of spectre's dependencies beyond python can be installed using the homebrew package manager.

First, install Homebrew:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then, run the following to install a fortran compiler and other dependencies:

brew install gcc
brew install openblas boost gsl cmake
brew install ccache autoconf automake catch2 jemalloc hdf5 pybind11 yaml-cpp

3. Install remaining dependencies

Here, we'll install the remaining dependencies other than charm++.

cd $SPECTRE_DEPS_ROOT
mkdir blaze
pushd blaze
curl -L https://bitbucket.org/blaze-lib/blaze/downloads/blaze-3.8.tar.gz > blaze-3.8.tar.gz
tar -xf blaze-3.8.tar.gz
mv blaze-3.8 include
popd

git clone https://github.com/edouarda/brigand.git

# Need master branch of libxsmm to support Apple Silicon
git clone https://github.com/hfp/libxsmm.git
pushd libxsmm
make
popd

pushd ./src
git clone https://github.com/Libsharp/libsharp.git
cd Libsharp

# Do not use compiler flag -march=native (unsupported on Apple Silicon)
sed "s/-march=native//" configure.ac > configure.ac.mod
mv configure.ac.mod configure.ac

autoupdate
autoconf
./configure
make
mv auto $SPECTRE_DEPS_ROOT/libsharp
popd

Next, install charm++:

git clone https://github.com/UIUC-PPL/charm
pushd charm
git checkout v7.0.0-rc1
./build charm++ multicore-darwin-arm8 -j8 --with-production
popd

4. Build a test spectre executable

You can install spectre anywhere; I chose the directory $HOME/spectre.

cd $HOME
git clone https://github.com/sxs-collaboration/spectre.git
cd spectre

Next, patch spectre with some changes necessary to get it to compile on Apple Silicon that are not yet part of develop.

git remote add geoffrey4444 https://github.com/geoffrey4444/spectre.git
git fetch geoffrey4444
git checkout -b AppleSilicon
git cherry-pick 9f0de8d03019b16291a2578e39ca1ab3bdafedea
git cherry-pick e6f5643f9b9f1ec9cb717364759a6c5ddb679612

Finally, configure and build spectre.

mkdir build
cd build

cmake \
    -D CMAKE_C_COMPILER=clang \
    -D CMAKE_CXX_COMPILER=clang++ \
    -D CMAKE_Fortran_COMPILER=gfortran \
    -D BUILD_PYTHON_BINDINGS=OFF \
    -D MEMORY_ALLOCATOR=SYSTEM \
    -D CHARM_ROOT=${SPECTRE_DEPS_ROOT}/charm/multicore-darwin-arm8 \
    -DBLAS_ROOT=$(brew --prefix openblas) \
    -DCMAKE_BUILD_TYPE=Release \
    -DDEBUG_SYMBOLS=OFF \
    -DUSE_PCH=ON \
    -DSPECTRE_UNIT_TEST_TIMEOUT_FACTOR=5 \
    -DSPECTRE_INPUT_FILE_TEST_TIMEOUT_FACTOR=5 \
    -DSPECTRE_PYTHON_TEST_TIMEOUT_FACTOR=5 \
    -DLIBXSMM_ROOT=${SPECTRE_DEPS_ROOT}/libxsmm/ \
    -DBLAZE_ROOT=${SPECTRE_DEPS_ROOT}/blaze/ \
    -DBRIGAND_ROOT=${SPECTRE_DEPS_ROOT}/brigand/ \
    -DLIBSHARP_INCLUDE_DIRS=${SPECTRE_DEPS_ROOT}/libsharp/include \
    -DLIBSHARP_ROOT=${SPECTRE_DEPS_ROOT}/libsharp/ \
    -DBoost_INCLUDE_DIRS=$(brew --prefix boost)/include/ \
    -DBoost_LIBRARIES=$(brew --prefix boost)/lib/ \
    -DBoost_ROOT=$(brew --prefix boost)/ \
    -DMACOSX_MIN=11.5 .. \
    -DGSL_ROOT=$(brew --prefix gsl)/include \
    -DGSL_LIBRARY=$(brew --prefix gsl)/lib/libgsl.a \
    -DBUILD_SHARED_LIBS=OFF \
    -D OVERRIDE_ARCH=apple_silicon \
    -DMACOS_SYS_LIB_ROOT=$(xcrun -sdk macosx --show-sdk-path) \
    ..

# I'm pretty sure the warnings you get when building are safe to ignore
# This builds an executable that evolves a single black hole
make -j8 EvolveGhKerrSchild

5. Run the test executable

Inside the spectre build directory, make a directory to hold some test simulations.

mkdir test_simulations
cd test_simulations
cp ../../tests/InputFiles/GeneralizedHarmonic/KerrSchild.yaml .

Edit KerrSchild.yaml. Change the Evolution: block to

Evolution:
  InitialTime: 0.0
  InitialTimeStep: 0.01
  TimeStepper: DormandPrince5

and change these options under DomainCreator:Shell as follows:

    InitialRefinement: 2
    InitialGridPoints: [9, 9]

Then, run on one core with

../bin/EvolveGhKerrSchild --input-file=KerrSchild.yaml

To run on 4 cores, delete the output via rm ./*h5 and then do

../bin/EvolveGhKerrSchild +p4 --input-file=KerrSchild.yaml
evan-charmworks commented 2 years ago

Thank you @geoffrey4444 for the detailed and comprehensive instructions. We will examine what is going on. Your directions will also be helpful for adding SpECTRE to Charm's continuous integration testing.

geoffrey4444 commented 2 years ago

Thank you @geoffrey4444 for the detailed and comprehensive instructions. We will examine what is going on. Your directions will also be helpful for adding SpECTRE to Charm's continuous integration testing.

That sounds great, @evan-charmworks ! Thank you so much for taking a look...I really appreciate it!

stwhite91 commented 2 years ago

I think we discussed in Core a couple weeks ago that one culprit in the poor scaling here might be the fact that darwin builds define the macro #define CMK_NOT_USE_TLS_THREAD 1 in the conv-mach.h file. This causes Cpv variables to be implemented using an array rather than TLS variables, which may be susceptible to false sharing issues in a multicore build such as this. We could try defining that to 0 and rebuilding Charm++ and Spectre. It may cause other issues, but it's worth a try.

geoffrey4444 commented 2 years ago

@stwhite91 please accept my apologies for taking so long to reply! I got swamped with a grant proposal and other end-of-semester work, and I've only now been able to try your idea.

I changed src/arch/multicore-darwin-arm8/conv-mach.h:64 in v7.0.0 of charm++ to 0 from 1 as you suggested, rebuilt charm, and rebuilt spectre. However, I still have the same problem: I see basically no change in performance when running on multiple cores: using +p1, +p2, +p4, +p8, and +ppn 7 +pemap 0-6 +commap 7 all fail to show the expected scaling. I would have expected using 8 cores would give at least a 7x speedup (assuming 1 core is stuck doing communication), but I see essentially zero difference in runtime.

I should have time to continue investigating this with you on a much more prompt timescale now, if this is something that you still might be able to help me resolve!

geoffrey4444 commented 2 years ago

Hi again,

I tried running the charm++ piArray example on my M1 Mac, to get an idea of what kinds of performance I should expect:

./piArray +p1 31250000 32 ---- At time 61.180054, pi=: 3.141576 ./piArray +p8 31250000 32 ---- At time 20.712907, pi=: 3.141562

I verified in Activity Monitor that the +p8 case was reporting running at 800%CPU, and that the activity was saturating the 8 performance cores.

Naively, I would have expected the speedup to be 8x, but it's actually only 3x. In contrast, a quick mpi4py version of the problem does give me the expected speedup (26.1 seconds on 1 core, 3.5 seconds on 8 cores).

Maybe it would be better to understand the scaling of the piArray example, instead of worrying about SpECTRE?

jszaday commented 2 years ago

Hello Geoffrey,

MPI-based Charm++ builds do not seem to exhibit these problems on an M1. After a quick: brew install openmpi

Then: ./build charm++ mpi-darwin-arm8 --with-production -g3 -j

I was able to get piArray to scale more sensibly:

erinys :: examples/charm++/piArray ‹main*› % mpirun -n 8 ./piArray 31250000 32                                                                                                                                                                                                  
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: -1 (desired: 0)
Charm++> Running in non-SMP mode: 8 processes (PEs)
Converse/Charm++ Commit ID: v7.1.0-devel-141-g7e59686a2
Isomalloc> Synchronized global address space.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 10 cores x 1 PUs = 10-way SMP)
Charm++> cpu topology info is gathered in 0.000 seconds.
At time 0.000219, array created.
At time 0.000248, main exits.
At time 7.704571, pi=: 3.141593 
[Partition 0][Node 0] End of program

I am still investigating the issues with "conventional" Charm++ builds (multicore, netlrts), but hopefully this helps!

Interestingly, Instrumentation.app does not show a clear cause for the slowdown -- most time is spent inside application-level code. Therefore, I'm theorizing this is a thread affinity issue, but more research is required.

Screen Shot 2021-12-10 at 2 53 47 PM
geoffrey4444 commented 2 years ago

Thanks for the update! I tried this on my MacBook Pro with version 7.0.0, and I don't notice any improvement in performance with mpi over nonmpi. I tried SMP and non-smp, and that also made no noticeable difference in performance on multiple cores. Maybe there's something different in the development version you're using? Do you see the speedup in the 7.0.0 release as well?

jszaday commented 2 years ago

Good point. I checked v7.0.0 and it performs comparably on my Mac. I can, at least, demonstrate poor scaling for netlrts and multicore builds, so I'll focus my efforts there. Particularly, SMP builds are terribly slow regardless of build so that's something to look into as well.

❯ ./charmrun +p1 ./piArray 31250000 32
Running as 1 OS processes:  ./piArray 31250000 32 
charmrun> mpirun -np 1  ./piArray 31250000 32 
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: -1 (desired: 0)
Charm++> Running in non-SMP mode: 1 processes (PEs)
Converse/Charm++ Commit ID: v7.0.0
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 10 cores x 1 PUs = 10-way SMP)
Charm++> cpu topology info is gathered in 0.000 seconds.
At time 0.000444, array created.
At time 0.000447, main exits.
At time 58.984039, pi=: 3.141622 
[Partition 0][Node 0] End of program
❯ ./charmrun +p8 ./piArray 31250000 32
Running as 8 OS processes:  ./piArray 31250000 32 
charmrun> mpirun -np 8  ./piArray 31250000 32 
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: -1 (desired: 0)
Charm++> Running in non-SMP mode: 8 processes (PEs)
Converse/Charm++ Commit ID: v7.0.0
Isomalloc> Synchronized global address space.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 10 cores x 1 PUs = 10-way SMP)
Charm++> cpu topology info is gathered in 0.000 seconds.
At time 0.000083, array created.
At time 0.000093, main exits.
At time 7.838483, pi=: 3.141587 
[Partition 0][Node 0] End of program
❯ ompi_info

                 Package: Open MPI brew@HMBRW-A-001-M1-004.local Distribution
                Open MPI: 4.1.2
  Open MPI repo revision: v4.1.2
   Open MPI release date: Nov 24, 2021
                Open RTE: 4.1.2
  Open RTE repo revision: v4.1.2
   Open RTE release date: Nov 24, 2021
                    OPAL: 4.1.2
      OPAL repo revision: v4.1.2
       OPAL release date: Nov 24, 2021
                 MPI API: 3.1.0
            Ident string: 4.1.2
                  Prefix: /opt/homebrew/Cellar/open-mpi/4.1.2
 Configured architecture: aarch64-apple-darwin21.1.0
.
.
.
geoffrey4444 commented 2 years ago

Thanks for sending this output! I figured out why my timing was different from yours. For some reason, after building charm++ with the command you gave me (./build charm++ mpi-darwin-arm8 --with-production -g3 -j), the piArray command wasn't using mpi. When I ran it, it was running in "standalone mode" using threads, which explains why it gave the same performance. Doing make clean; make in the piArray directory fixed this, and now I get the same timing for piArray as you quote above, which makes sense, since the output confirms it's now using mpi.

evan-charmworks commented 1 year ago

I've tried some different strategies for resolving this within-node performance gap on multicore-darwin-arm8, including things recommended by an official Apple tutorial, but so far have not made any improvement.

I suspect that piArray may not be the best choice of test case due to its use of the CrnDrand function, which can take up to 10% of the execution's CPU time according to Xcode's Instruments profiler. This may have confounded my measurements. @stwhite91 would you recommend a particular Charm++ test/example/benchmark for assessing SMP performance?

One thing also worth noting is that +p8 is not ideal for an M1 Mac, because its CPU is comprised of 4 performance cores and 4 efficiency cores. For this reason I have been focusing on +p4.

evan@tenacity ~ % lstopo   
Machine (3577MB total)
  Package L#0
    NUMANode L#0 (P#0 3577MB)
    L2 L#0 (4096KB)
      L1d L#0 (64KB) + L1i L#0 (128KB) + Core L#0 + PU L#0 (P#0)
      L1d L#1 (64KB) + L1i L#1 (128KB) + Core L#1 + PU L#1 (P#1)
      L1d L#2 (64KB) + L1i L#2 (128KB) + Core L#2 + PU L#2 (P#2)
      L1d L#3 (64KB) + L1i L#3 (128KB) + Core L#3 + PU L#3 (P#3)
    L2 L#1 (12MB)
      L1d L#4 (128KB) + L1i L#4 (192KB) + Core L#4 + PU L#4 (P#4)
      L1d L#5 (128KB) + L1i L#5 (192KB) + Core L#5 + PU L#5 (P#5)
      L1d L#6 (128KB) + L1i L#6 (192KB) + Core L#6 + PU L#6 (P#6)
      L1d L#7 (128KB) + L1i L#7 (192KB) + Core L#7 + PU L#7 (P#7)
  CoProc(OpenCL) "opencl0d0"
stwhite91 commented 1 year ago

I would say examples/charm++/jacobi3d-2d-decomposition is representative of many HPC applications. It has both medium-sized msg point-to-point and small msg allreduce communication. I'd also look at microbenchmarks like benchmarks/charm++/pingpong and benchmarks/converse/commbench. Commbench in particular has a lot of different measurements that may be relevant for SMP. You could also run examples/ampi/Cjacobi3d and benchmarks/ampi/pingpong.

With all of these you'll want to increase the number of iterations they run for to get consistent timings. We keep the default iteration counts low to speed up functional testing.