Open oskooi opened 5 years ago
Maybe the operating system is migrating the Meep processes between cores once you start doing other stuff on the machine. You might be able to avoid this by setting the processor affinity (see e.g. https://aciref.org/how-to-gain-hybrid-mpi-openmp-code-performance-without-changing-a-line-of-code-a-k-a-dealing-with-task-affinity/ and https://www.glennklockwood.com/hpc-howtos/process-affinity.html).
The main simulation writes its output to disk
I don't see any output in the script you posted?
Hyperthreading is enabled on my machine which means there are in fact just 4 physical cores:
> lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
CPUs 0-3 are the physical threads as determined by:
> grep -H . /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | sort -n -t ',' -k 2 -u
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,4
/sys/devices/system/cpu/cpu1/topology/thread_siblings_list:1,5
/sys/devices/system/cpu/cpu2/topology/thread_siblings_list:2,6
/sys/devices/system/cpu/cpu3/topology/thread_siblings_list:3,7
Based on this information, the test script is launched using 2 processes on CPUs 0 and 1 with processor affinity via mpirun
's --bind-to core
option (which is the default for <=2 processes):
mpirun -np 2 --cpu-set 0,1 --bind-to core python3.5 oled_test.py
Next, a serial "background" process on CPU 2 is launched:
mpirun -np 1 --cpu-set 2 --bind-to core python3.5 oled_test.py
However, the slow down still occurs resulting in a 50% increase in the time-stepping rate. I also tried replacing --bind-to core
with bind-to hwthread
and the results were similar.
The same behavior is observed on AWS EC2 via the Simpetus AMI running Ubuntu 16.04 using a c4.2xlarge
instance and disabling hyperthreading:
#!/bin/bash
for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un)
do
echo 0 > /sys/devices/system/cpu/cpu$cpunum/online
done
> lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-3
Off-line CPU(s) list: 4-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
> lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0 0 0 0 0:0:0:0 yes
1 0 0 1 1:1:1:0 yes
2 0 0 2 2:2:2:0 yes
3 0 0 3 3:3:3:0 yes
4 - - - ::: no
5 - - - ::: no
6 - - - ::: no
7 - - - ::: no
With the 4 available virtual cores, the first (main) job is launched on CPUs 0 and 1 followed later by the second (background) job on CPUs 2 and 3:
### main job
mpirun -np 2 --cpu-set 0,1 --bind-to core --report-bindings python3.5 oled_test.py
### background job
mpirun -np 2 --cpu-set 2,3 --bind-to core --report-bindings python3.5 oled_test.py
The main job has a time-stepping rate of ~0.16 s/step when run independently. With the background job running simultaneously, the rate slows down to ~0.27 s/step which is a ~70% slow down (both jobs eventually reach the same rate). As soon as the background job is stopped, the rate reverts to its original value.
As an additional observation, with hyperthreading disabled and running with 2 processes (in a system with 4 available physical cores), the results (i.e., the time-stepping rate) are the same using --bind-to core
and --bind-to none
. This seems to suggest that process migration among cores is not the cause of the slow-down.
--bind-to core
> mpirun -np 2 --bind-to core --report-bindings python oled_test.py
> for i in $(pgrep python); do ps -mo pid,tid,psr -p $i;done
PID TID PSR
10518 - -
- 10518 0
- 10521 0
- 10523 0
- 10525 0
- 10527 0
- 10529 0
- 10531 0
- 10533 0
- 10536 0
- 10537 0
PID TID PSR
10519 - -
- 10519 1
- 10520 1
- 10522 1
- 10524 1
- 10526 1
- 10528 1
- 10530 1
- 10532 1
- 10534 1
- 10535 1
--bind-to none
> mpirun -np 2 --bind-to none --report-bindings python oled_test.py
> for i in $(pgrep python); do ps -mo pid,tid,psr -p $i;done
PID TID PSR
10966 - -
- 10966 0
- 10968 0
- 10969 2
- 10970 1
- 10981 0
- 10982 3
- 10983 0
- 10984 3
- 10985 0
- 10986 1
- 10987 1
- 10988 0
- 10989 3
PID TID PSR
10967 - -
- 10967 1
- 10971 1
- 10972 2
- 10973 0
- 10974 3
- 10975 1
- 10976 0
- 10977 0
- 10978 1
- 10979 2
- 10980 0
- 10990 3
- 10991 2
It's strange that each of your processes have so many threads. When I run the same command, each process only has one thread associated with it:
$ mpirun -n 2 python oled_ext_eff.py
$ for i in $(pgrep python); do ps -mo pid,tid,psr -p $i; done
PID TID PSR
974 - -
- 974 0
PID TID PSR
975 - -
- 975 1
Moreover, I don't see the slowdown that you experience when running multiple jobs. Do you see the same behavior with the conda package? I assume you're building from source.
The results above were from building from source. However, the same phenomenon is observed using the Conda packages installed on a clean c4.2xlarge
instance running Ubuntu 16.04 via conda create -n pmp -c conda-forge pymeep=*=mpi_mpich_*
. Hyperthreading is disabled but the MPICH package does not seem to support the --bind-to core
option (which, as reported previously, does not make a difference). There are still multiple threads per process (but fewer than previously).
> mpirun -n 2 python -u oled_ext_eff.py
> for i in $(pgrep python); do ps -mo pid,tid,psr -p $i; done
PID TID PSR
2573 - -
- 2573 1
- 2575 1
- 2578 2
- 2579 2
PID TID PSR
2574 - -
- 2574 3
- 2576 3
- 2577 1
- 2580 3
The time-stepping rate is ~0.10 s/step for a single job which then slows down to ~0.14 s/step when an identical background job is running simultaneously. When the background job is stopped, the time-stepping rate reverts to ~0.10 s/step.
Actually, the --bind-to core
option is supported by MPICH which does constrain the job to 1 thread per process. However, the slowdown effect is still present and to an even greater extent: ~0.27 s/step or ~200%.
> mpirun -n 2 --bind-to core python -u oled_ext_eff.py
> for i in $(pgrep python); do ps -mo pid,tid,psr -p $i; done
PID TID PSR
2688 - -
- 2688 0
PID TID PSR
2689 - -
- 2689 1
We might just have to accept this. Modern hardware is complicated, and we don't have control over things like cache contention between processes; we probably just have to let the OS (and things like MPI) do the best they can…
Of the 11,582 samples I've collected, 399 contain a change from one timestep to the next that's greater than 10%. So we're only seeing a performance degradation in 3.4%
of the collected data.
There has been some work already (pdf) to quantify the effect of cache contention on application performance. The focus is the last-level cache (e.g., L3, L4, etc.) which tends to be shared among the cores.
It is worth noting that all the tests thus far demonstrating a slowdown in the time-stepping rate have involved single-socket, multi-core machines which involve some form of shared cache. As an additional check, perhaps we should try running the tests on a shared-memory system with non-shared cache?
It seems that the choice of the compiler optimization has an impact on the rate of slowdown: the more compiler optimization there is, the larger the slowdown due to background processes. The following are the ("steady-state") time-stepping rates for a serial job (i.e., no MPI) based on the OLED example from above using processor binding via numactl
for one isolated and two simultaneous (identical) jobs (on a 4-core 4.2 GHz Kaby Lake system with hyperthreading disabled) for three different GCC optimization flags.
--enable-debug
1 job: 0.234191 s/step
2 jobs: 0.272135 s/step
slowdown: 16.2%
CXXFLAGS=-O2
1 job: 0.149296 s/step
2 jobs: 0.241989 s/step
slowdown: 62.1%
CXXFLAGS=-O3
1 job: 0.149782 s/step
2 jobs: 0.237891 s/step
slowdown: 58.8%
Thanks to @HomerReid for suggesting this benchmarking comparison.
However, it is still faster in absolute terms with -O3
than with --enable-debug
. The upshot, I think, is simply that when your code is faster, it is harder to parallelize and the remaining performance is easier to disrupt.
I believe the symptoms described could be consistent with total memory bandwidth being saturated. In particular, it's consistent with the fact that slower (debug) simulations co-exist better.
Let's say FDTD reads ~20 doubles per cell and time step, then 100GB/s memory bandwidth leads to 625 million cell updates per second at 100% efficiency. The efficiency is bound to be lower, even in the most optimistic case. Once several uncoordinated processes compete for memory reads & cache space, it's not unreasonable to expect slowdowns.
For simple but large (65M yee cells, dielectrics) I have seen Meep do ~220M yee cell updates per second on 8 cores. I haven't tried running two of those at the same time on 16 cores, but I would expect them to both go slower due to total memory bandwidth.
Monitoring memory bandwidth seems non-trivial. Intel has some tools for Xeon processors. Stream seems to be the standard benchmark for measuring a specific machine. (The 100GB/s number above is only an order of magnitude).
You could take a look at Intel's Vtune to get some insight into memory characteristics. The Intel compiler and MPI seem to be free now.
There seems to be noticeable performance degradations for an individual simulation due to unrelated background jobs on a single, shared-memory, multi-core machine. This is demonstrated by the following 3d example which is a slightly modified version of a separate demonstration for computing the light-extraction efficiency of an OLED.
The script is run using 3 processes on an 8-core Kaby Lake 4.2 GHz machine. During various stages throughout the time-stepping, background jobs are launched (using the same script) using 1 and 2 processes. No other tasks/jobs are run to ensure that the full system resources (i.e., CPU and memory) are always underutilized. The main simulation writes its output to disk while the background jobs (launched in separate terminals) do not.
The figure below shows the time-stepping rate (s/step) throughout the entire interval: whenever a background job is launched, there is an immediate and significant slow-down in the time-stepping rate (~100% for the 2- and ~50% for the 1-process job) which persists for the entire duration that the background job is running/active. As soon as the background job is terminated, the time-stepping rate reverts to its original value. Also, slow-down occurs to varying degrees during other system-level tasks unrelated to Meep such as opening file folders, application programs, browser windows, etc.
One would naively expect that for the case of an underutilized shared-memory system, the performance of an individual simulation should remain fairly constant and not be affected by background system activity. However, this does not seem to be the case.