NanoComp / meep

free finite-difference time-domain (FDTD) software for electromagnetic simulations
GNU General Public License v2.0
1.23k stars 624 forks source link

performance degradation due to background processes in shared-memory systems #882

Open oskooi opened 5 years ago

oskooi commented 5 years ago

There seems to be noticeable performance degradations for an individual simulation due to unrelated background jobs on a single, shared-memory, multi-core machine. This is demonstrated by the following 3d example which is a slightly modified version of a separate demonstration for computing the light-extraction efficiency of an OLED.

The script is run using 3 processes on an 8-core Kaby Lake 4.2 GHz machine. During various stages throughout the time-stepping, background jobs are launched (using the same script) using 1 and 2 processes. No other tasks/jobs are run to ensure that the full system resources (i.e., CPU and memory) are always underutilized. The main simulation writes its output to disk while the background jobs (launched in separate terminals) do not.

The figure below shows the time-stepping rate (s/step) throughout the entire interval: whenever a background job is launched, there is an immediate and significant slow-down in the time-stepping rate (~100% for the 2- and ~50% for the 1-process job) which persists for the entire duration that the background job is running/active. As soon as the background job is terminated, the time-stepping rate reverts to its original value. Also, slow-down occurs to varying degrees during other system-level tasks unrelated to Meep such as opening file folders, application programs, browser windows, etc.

One would naively expect that for the case of an underutilized shared-memory system, the performance of an individual simulation should remain fairly constant and not be affected by background system activity. However, this does not seem to be the case.

import meep as mp
from meep.materials import Al
import cmath
import random

resolution = 40             # pixels/um                                                                                                                                                                                     

lambda_min = 0.4            # minimum source wavelength                                                                                                                                                                     
lambda_max = 0.8            # maximum source wavelength                                                                                                                                                                     
fmin = 1/lambda_max         # minimum source frequency                                                                                                                                                                      
fmax = 1/lambda_min         # maximum source frequency                                                                                                                                                                      
fcen = 0.5*(fmin+fmax)      # source frequency center                                                                                                                                                                       
df = fmax-fmin              # source frequency width                                                                                                                                                                        

tABS = lambda_max           # absorber/PML thickness                                                                                                                                                                        
tGLS = 1.0                  # glass thickness                                                                                                                                                                               
tITO = 0.1                  # indium tin oxide thickness                                                                                                                                                                    
tORG = 0.1                  # organic thickness                                                                                                                                                                             
tAl = 0.1                   # aluminum thickness                                                                                                                                                                            
L = 4.0                     # length of OLED                                                                                                                                                                                

# length of computational cell along Z                                                                                                                                                                                      
sz = tABS+tGLS+tITO+tORG+tAl
# length of non-absorbing region of computational cell in X and Y                                                                                                                                                           
sxy = L+2*tABS
cell_size = mp.Vector3(sxy,sxy,sz)

boundary_layers = [mp.Absorber(tABS,direction=mp.X),
                   mp.Absorber(tABS,direction=mp.Y),
                   mp.PML(tABS,direction=mp.Z,side=mp.High)]

ORG = mp.Medium(index=1.75)
ITO = mp.Medium(index=1.80)
GLS = mp.Medium(index=1.45)

geometry = [mp.Block(material=GLS, size=mp.Vector3(mp.inf,mp.inf,tABS+tGLS), center=mp.Vector3(0,0,0.5*sz-0.5*(tABS+tGLS))),
            mp.Block(material=ITO, size=mp.Vector3(mp.inf,mp.inf,tITO), center=mp.Vector3(0,0,0.5*sz-tABS-tGLS-0.5*tITO)),
            mp.Block(material=ORG, size=mp.Vector3(mp.inf,mp.inf,tORG), center=mp.Vector3(0,0,0.5*sz-tABS-tGLS-tITO-0.5*tORG)),
            mp.Block(material=Al, size=mp.Vector3(mp.inf,mp.inf,tAl), center=mp.Vector3(0,0,0.5*sz-tABS-tGLS-tITO-tORG-0.5*tAl))]

src_cmpt = mp.Ez
symmetries = [mp.Mirror(mp.X,+1), mp.Mirror(mp.Y,+1)]

num_src = 10                 # number of point sources                                                                                                                                                                      
sources = [];
for n in range(num_src):
    sources.append(mp.Source(mp.GaussianSource(fcen, fwidth=df), component=src_cmpt,
                             center=mp.Vector3(0,0,0.5*sz-tABS-tGLS-tITO-0.4*tORG-0.2*tORG*n/num_src),
                             amplitude=cmath.exp(2*cmath.pi*random.random()*1j)))

sim = mp.Simulation(resolution=resolution,
                    cell_size=cell_size,
                    boundary_layers=boundary_layers,
                    geometry=geometry,
                    sources=sources,
                    force_complex_fields=True,
                    symmetries=symmetries,
                    split_chunks_evenly=False)

# number of frequency bins for DFT fields                                                                                                                                                                                   
nfreq = 21

# surround source with a six-sided box of flux planes                                                                                                                                                                       
srcbox_width = 0.05
srcbox_top = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0,0,0.5*sz-tABS-tGLS), size=mp.Vector3(srcbox_width,srcbox_width,0), direction=mp.Z, weight=+1))
srcbox_bot = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0,0,0.5*sz-tABS-tGLS-tITO-0.8*tORG), size=mp.Vector3(srcbox_width,srcbox_width,0), direction=mp.Z, weight=-1))
srcbox_xp = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0.5*srcbox_width,0,0.5*sz-tABS-tGLS-0.5*(tITO+0.8*tORG)), size=mp.Vector3(0,srcbox_width,tITO+0.8*tORG), direction=mp.X, weight=+1))
srcbox_xm = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(-0.5*srcbox_width,0,0.5*sz-tABS-tGLS-0.5*(tITO+0.8*tORG)), size=mp.Vector3(0,srcbox_width,tITO+0.8*tORG), direction=mp.X, weight=-1))
srcbox_yp = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0,0.5*srcbox_width,0.5*sz-tABS-tGLS-0.5*(tITO+0.8*tORG)), size=mp.Vector3(srcbox_width,0,tITO+0.8*tORG), direction=mp.Y, weight=+1))
srcbox_ym = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0,-0.5*srcbox_width,0.5*sz-tABS-tGLS-0.5*(tITO+0.8*tORG)), size=mp.Vector3(srcbox_width,0,tITO+0.8*tORG), direction=mp.Y, weight=-1))

# padding for flux box to fully capture waveguide mode                                                                                                                                                                      
fluxbox_dpad = 0.05

glass_flux = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(center=mp.Vector3(0,0,0.5*sz-tABS-(tGLS-fluxbox_dpad)), size = mp.Vector3(L,L,0), direction=mp.Z, weight=+1))
wvgbox_xp = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(size=mp.Vector3(0,L,fluxbox_dpad+tITO+tORG+fluxbox_dpad), direction=mp.X, center=mp.Vector3(0.5*L,0,0.5*sz-tABS-tGLS-0.5*(tITO+tORG)), weight=+1))
wvgbox_xm = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(size=mp.Vector3(0,L,fluxbox_dpad+tITO+tORG+fluxbox_dpad), direction=mp.X, center=mp.Vector3(-0.5*L,0,0.5*sz-tABS-tGLS-0.5*(tITO+tORG)), weight=-1))
wvgbox_yp = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(size=mp.Vector3(L,0,fluxbox_dpad+tITO+tORG+fluxbox_dpad), direction=mp.Y, center=mp.Vector3(0,0.5*L,0.5*sz-tABS-tGLS-0.5*(tITO+tORG)), weight=+1))
wvgbox_ym = sim.add_flux(fcen, df, nfreq, mp.FluxRegion(size=mp.Vector3(L,0,fluxbox_dpad+tITO+tORG+fluxbox_dpad), direction=mp.Y, center=mp.Vector3(0,-0.5*L,0.5*sz-tABS-tGLS-0.5*(tITO+tORG)), weight=-1))

sim.run(until=100)

fluctuations

stevengj commented 5 years ago

Maybe the operating system is migrating the Meep processes between cores once you start doing other stuff on the machine. You might be able to avoid this by setting the processor affinity (see e.g. https://aciref.org/how-to-gain-hybrid-mpi-openmp-code-performance-without-changing-a-line-of-code-a-k-a-dealing-with-task-affinity/ and https://www.glennklockwood.com/hpc-howtos/process-affinity.html).

The main simulation writes its output to disk

I don't see any output in the script you posted?

oskooi commented 5 years ago

Hyperthreading is enabled on my machine which means there are in fact just 4 physical cores:

> lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz

CPUs 0-3 are the physical threads as determined by:

> grep -H . /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | sort -n -t ',' -k 2 -u
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,4
/sys/devices/system/cpu/cpu1/topology/thread_siblings_list:1,5
/sys/devices/system/cpu/cpu2/topology/thread_siblings_list:2,6
/sys/devices/system/cpu/cpu3/topology/thread_siblings_list:3,7

Based on this information, the test script is launched using 2 processes on CPUs 0 and 1 with processor affinity via mpirun's --bind-to core option (which is the default for <=2 processes):

mpirun -np 2 --cpu-set 0,1 --bind-to core python3.5 oled_test.py

Next, a serial "background" process on CPU 2 is launched:

mpirun -np 1 --cpu-set 2 --bind-to core python3.5 oled_test.py

However, the slow down still occurs resulting in a 50% increase in the time-stepping rate. I also tried replacing --bind-to core with bind-to hwthread and the results were similar.

oskooi commented 5 years ago

The same behavior is observed on AWS EC2 via the Simpetus AMI running Ubuntu 16.04 using a c4.2xlarge instance and disabling hyperthreading:

#!/bin/bash

for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un)
do
        echo 0 > /sys/devices/system/cpu/cpu$cpunum/online
done
> lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-3
Off-line CPU(s) list:  4-7
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
> lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0   0    0      0    0:0:0:0       yes
1   0    0      1    1:1:1:0       yes
2   0    0      2    2:2:2:0       yes
3   0    0      3    3:3:3:0       yes
4   -    -      -    :::           no
5   -    -      -    :::           no
6   -    -      -    :::           no
7   -    -      -    :::           no

With the 4 available virtual cores, the first (main) job is launched on CPUs 0 and 1 followed later by the second (background) job on CPUs 2 and 3:

### main job
mpirun -np 2 --cpu-set 0,1 --bind-to core --report-bindings python3.5 oled_test.py
### background job
mpirun -np 2 --cpu-set 2,3 --bind-to core --report-bindings python3.5 oled_test.py

The main job has a time-stepping rate of ~0.16 s/step when run independently. With the background job running simultaneously, the rate slows down to ~0.27 s/step which is a ~70% slow down (both jobs eventually reach the same rate). As soon as the background job is stopped, the rate reverts to its original value.

oskooi commented 5 years ago

As an additional observation, with hyperthreading disabled and running with 2 processes (in a system with 4 available physical cores), the results (i.e., the time-stepping rate) are the same using --bind-to core and --bind-to none. This seems to suggest that process migration among cores is not the cause of the slow-down.

--bind-to core

> mpirun -np 2 --bind-to core --report-bindings python oled_test.py
> for i in $(pgrep python); do ps -mo pid,tid,psr -p $i;done
  PID   TID PSR
10518     -   -
    - 10518   0
    - 10521   0
    - 10523   0
    - 10525   0
    - 10527   0
    - 10529   0
    - 10531   0
    - 10533   0
    - 10536   0
    - 10537   0
  PID   TID PSR
10519     -   -
    - 10519   1
    - 10520   1
    - 10522   1
    - 10524   1
    - 10526   1
    - 10528   1
    - 10530   1
    - 10532   1
    - 10534   1
    - 10535   1

--bind-to none

> mpirun -np 2 --bind-to none --report-bindings python oled_test.py
> for i in $(pgrep python); do ps -mo pid,tid,psr -p $i;done
  PID   TID PSR
10966     -   -
    - 10966   0
    - 10968   0
    - 10969   2
    - 10970   1
    - 10981   0
    - 10982   3
    - 10983   0
    - 10984   3
    - 10985   0
    - 10986   1
    - 10987   1
    - 10988   0
    - 10989   3
  PID   TID PSR
10967     -   -
    - 10967   1
    - 10971   1
    - 10972   2
    - 10973   0
    - 10974   3
    - 10975   1
    - 10976   0
    - 10977   0
    - 10978   1
    - 10979   2
    - 10980   0
    - 10990   3
    - 10991   2
ChristopherHogan commented 5 years ago

It's strange that each of your processes have so many threads. When I run the same command, each process only has one thread associated with it:

$ mpirun -n 2 python oled_ext_eff.py
$ for i in $(pgrep python); do ps -mo pid,tid,psr -p $i; done
  PID   TID PSR
  974     -   -
    -   974   0
  PID   TID PSR
  975     -   -
    -   975   1

Moreover, I don't see the slowdown that you experience when running multiple jobs. Do you see the same behavior with the conda package? I assume you're building from source.

oskooi commented 5 years ago

The results above were from building from source. However, the same phenomenon is observed using the Conda packages installed on a clean c4.2xlarge instance running Ubuntu 16.04 via conda create -n pmp -c conda-forge pymeep=*=mpi_mpich_*. Hyperthreading is disabled but the MPICH package does not seem to support the --bind-to core option (which, as reported previously, does not make a difference). There are still multiple threads per process (but fewer than previously).

> mpirun -n 2 python -u oled_ext_eff.py
> for i in $(pgrep python); do ps -mo pid,tid,psr -p $i; done
  PID   TID PSR
 2573     -   -
    -  2573   1
    -  2575   1
    -  2578   2
    -  2579   2
  PID   TID PSR
 2574     -   -
    -  2574   3
    -  2576   3
    -  2577   1
    -  2580   3

The time-stepping rate is ~0.10 s/step for a single job which then slows down to ~0.14 s/step when an identical background job is running simultaneously. When the background job is stopped, the time-stepping rate reverts to ~0.10 s/step.

oskooi commented 5 years ago

Actually, the --bind-to core option is supported by MPICH which does constrain the job to 1 thread per process. However, the slowdown effect is still present and to an even greater extent: ~0.27 s/step or ~200%.

> mpirun -n 2 --bind-to core python -u oled_ext_eff.py
> for i in $(pgrep python); do ps -mo pid,tid,psr -p $i; done
  PID   TID PSR
 2688     -   -
    -  2688   0
  PID   TID PSR
 2689     -   -
    -  2689   1
stevengj commented 5 years ago

We might just have to accept this. Modern hardware is complicated, and we don't have control over things like cache contention between processes; we probably just have to let the OS (and things like MPI) do the best they can…

ChristopherHogan commented 5 years ago

Of the 11,582 samples I've collected, 399 contain a change from one timestep to the next that's greater than 10%. So we're only seeing a performance degradation in 3.4% of the collected data.

oskooi commented 5 years ago

There has been some work already (pdf) to quantify the effect of cache contention on application performance. The focus is the last-level cache (e.g., L3, L4, etc.) which tends to be shared among the cores.

It is worth noting that all the tests thus far demonstrating a slowdown in the time-stepping rate have involved single-socket, multi-core machines which involve some form of shared cache. As an additional check, perhaps we should try running the tests on a shared-memory system with non-shared cache?

oskooi commented 5 years ago

It seems that the choice of the compiler optimization has an impact on the rate of slowdown: the more compiler optimization there is, the larger the slowdown due to background processes. The following are the ("steady-state") time-stepping rates for a serial job (i.e., no MPI) based on the OLED example from above using processor binding via numactl for one isolated and two simultaneous (identical) jobs (on a 4-core 4.2 GHz Kaby Lake system with hyperthreading disabled) for three different GCC optimization flags.

--enable-debug 1 job: 0.234191 s/step 2 jobs: 0.272135 s/step slowdown: 16.2%

CXXFLAGS=-O2 1 job: 0.149296 s/step 2 jobs: 0.241989 s/step slowdown: 62.1%

CXXFLAGS=-O3 1 job: 0.149782 s/step 2 jobs: 0.237891 s/step slowdown: 58.8%

Thanks to @HomerReid for suggesting this benchmarking comparison.

stevengj commented 5 years ago

However, it is still faster in absolute terms with -O3 than with --enable-debug. The upshot, I think, is simply that when your code is faster, it is harder to parallelize and the remaining performance is easier to disrupt.

jeinarsson commented 4 years ago

I believe the symptoms described could be consistent with total memory bandwidth being saturated. In particular, it's consistent with the fact that slower (debug) simulations co-exist better.

Let's say FDTD reads ~20 doubles per cell and time step, then 100GB/s memory bandwidth leads to 625 million cell updates per second at 100% efficiency. The efficiency is bound to be lower, even in the most optimistic case. Once several uncoordinated processes compete for memory reads & cache space, it's not unreasonable to expect slowdowns.

For simple but large (65M yee cells, dielectrics) I have seen Meep do ~220M yee cell updates per second on 8 cores. I haven't tried running two of those at the same time on 16 cores, but I would expect them to both go slower due to total memory bandwidth.

Monitoring memory bandwidth seems non-trivial. Intel has some tools for Xeon processors. Stream seems to be the standard benchmark for measuring a specific machine. (The 100GB/s number above is only an order of magnitude).

ChristopherHogan commented 4 years ago

You could take a look at Intel's Vtune to get some insight into memory characteristics. The Intel compiler and MPI seem to be free now.