Open Gumi-presentation-by-Dzh opened 6 years ago
What should I set when I test the MPI programs ? except EMUL_LOCAL_PROCESSES environment variable
We try the MPI demo "helloworld" but there is fatal error in MPI_Finalize. Did we need to set something when we test MPI programs?
Here is the test source.
int main (int argc, char **argv) { int myid, numprocs; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &myid); MPI_Comm_size (MPI_COMM_WORLD, &numprocs); MPI_Get_processor_name (processor_name, &namelen); fprintf (stderr, "Hello World! Process %d of %d on %s\n", myid, numprocs, processor_name); MPI_Finalize (); return 0; }
Hi @Gumi-presentation-by-Dzh, How many CPUs and cores per CPU do you have. What is the value of EMUL_LOCAL_PROCESSES you are setting and how many MPI processes/ranks you are spawning? Which MPI library and version you are using?
The best configuration is to configure EMUL_LOCAL_PROCESSES with the same number of processes spawn by the MPI program. The recommended configuration for the emulator is to spawn the same number of cores of a single CPU chip. Also try with different MPI libraries.
Please enable debug level to 5 and send the entire log.
We have two CPUS and 10 cores per CPU. We spawn 4 MPI processes and We set EMUL_LOCAL_PROCESSES as 4 like you mentioned. But I already export the EMUL_LOCAL_PROCESSES as 4 and the debug said No EMUL_LOCAL_PROCESSES variable set ,skipping rank setting. Why?
Did you export this variable in the same shell you execute the MPI program? $ export EMUL_LOCAL_PROCESSES=4
Yes, I did it again, it said
[13869] DEBUG: setting process local rank for 4 local processes [13869] ERROR: failed to set process local rank
[14224] DEBUG: [14224] INFO: Detected CPU model 'Haswell Xeon' [14224] DEBUG: Possible NUMA nodes are 1024 [14224] DEBUG: NUMA nodes allowed are 1024 [14224] DEBUG: NUMA configured CPUs are 20 [14224] INFO: 10 CPUs on physical node 0 [14224] INFO: 10 CPUs on physical node 1 [14224] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus [14224] INFO: node: 1, pci addr: 7f:14.0 [14224] INFO: node: 1, pci bus: 0x7f [14224] INFO: node: 1, pci addr: 7f:14.1 [14224] INFO: node: 1, pci addr: 7f:17.0 [14224] INFO: node: 1, pci addr: 7f:17.1 [14224] WARNING: No complete memory-controller pci topology found in /tmp/mc_pci_bus [14224] INFO: Number of physical nodes 2 [14224] INFO: Number of virtual nodes 1 [14224] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0 [14224] INFO: measuring latency: latency is 81 ns [14224] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1 [14224] INFO: measuring latency: latency is 122 ns [14224] INFO: Fusing physical nodes 0 1 into virtual node 0 [14224] INFO: Initializing latency model [14224] DEBUG: setting process local rank for 4 local processes [14224] ERROR: failed to set process local rank
Please remove manually "/tmp/emul_lock_file" and "/tmp/emul_process_local_rank" files. The lock file is preventing the new process to set its rank. When the processes crash, they may leave these files behind. Please clean them up manually.
I remove these two files but still can't work well, the MPI progrem has no output and can not end
The debug message is
[14601] DEBUG: [14601] INFO: Detected CPU model 'Haswell Xeon' [14601] DEBUG: Possible NUMA nodes are 1024 [14601] DEBUG: NUMA nodes allowed are 1024 [14601] DEBUG: NUMA configured CPUs are 20 [14601] INFO: 10 CPUs on physical node 0 [14601] INFO: 10 CPUs on physical node 1 [14601] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus [14601] INFO: node: 1, pci addr: 7f:14.0 [14601] INFO: node: 1, pci bus: 0x7f [14601] INFO: node: 1, pci addr: 7f:14.1 [14601] INFO: node: 1, pci addr: 7f:17.0 [14601] INFO: node: 1, pci addr: 7f:17.1 [14601] INFO: Number of physical nodes 2 [14601] INFO: Number of virtual nodes 1 [14601] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0 [14601] INFO: measuring latency: latency is 87 ns [14601] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1 [14601] INFO: measuring latency: latency is 141 ns [14601] INFO: Fusing physical nodes 0 1 into virtual node 0 [14601] INFO: Initializing latency model [14601] DEBUG: setting process local rank for 4 local processes [14599] ERROR: failed to set process local rank [14601] ERROR: failed to set process local rank [14600] DEBUG: overhead cycles: 43698481; immediate overhead 43026143; stall cycles: 0; delay cycles: 0 [14600] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead [14600] DEBUG: thread id [14600] last epoch was 9 usec ago
As we mentioned above , the MPI program is no longer running down, and the DEBUG information of the simulator is not outputted.
@Gumi-presentation-by-Dzh,
Can you track which process is still running? Use 'ps aux | grep
Try different number of MPI processes and also remove the "EMUL_LOCAL_PROCESSES" variable for now.
here is the track: root 13866 0.0 1.1 590716 546052 ? Ssl 21:17 0:02 /home/liao/hello root 14049 0.0 1.1 590716 546048 ? Ssl 21:19 0:02 /home/liao/hello root 14050 0.0 1.1 590716 546052 ? Ssl 21:19 0:02 /home/liao/hello root 14051 3.3 1.1 590716 546052 ? Ssl 21:19 3:45 /home/liao/hello root 14052 0.0 1.1 590716 546052 ? Ssl 21:19 0:02 /home/liao/hello root 14600 0.0 1.1 590716 546056 ? Ssl 21:28 0:02 /home/liao/hello root 15320 2.8 1.1 590716 546052 ? Ssl 23:12 0:02 /home/liao/hello root 15321 2.8 1.1 590716 546052 ? Ssl 23:12 0:02 /home/liao/hello root 15322 6.0 1.1 590716 546052 ? Ssl 23:12 0:06 /home/liao/hello root 15323 6.0 1.1 590716 546052 ? Ssl 23:12 0:06 /home/liao/hello root 15429 0.0 0.0 113048 1432 pts/0 S+ 23:13 0:00 sh run.sh mpirun -np 4 /home/liao/hello root 15457 0.0 0.0 113048 1444 pts/0 S+ 23:13 0:00 /bin/bash /home/ZHduan/HME-q/scripts/runenv.sh mpirun -np 4 /home/liao/hello root 15492 8.7 1.1 587488 545692 pts/0 Sl+ 23:13 0:03 mpirun -np 4 /home/liao/hello root 15500 7.2 1.1 590716 546056 ? Ssl 23:13 0:02 /home/liao/hello liao 15509 0.0 0.0 112588 948 pts/1 S+ 23:13 0:00 grep --color=auto hello
here is the top message:
15322 root 20 0 590716 546052 1648 S 3.7 1.1 0:09.55 hello
13863 root 20 0 585948 545992 1240 S 3.3 1.1 3:56.19 hydra_pmi_proxy
14045 root 20 0 585948 545520 1232 S 3.3 1.1 3:47.16 hydra_pmi_proxy
14051 root 20 0 590716 546052 1648 S 3.3 1.1 3:49.19 hello
15323 root 20 0 590716 546052 1648 S 3.3 1.1 0:09.61 hello
15492 root 20 0 587488 545692 1364 S 3.3 1.1 0:07.14 mpirun
15494 root 20 0 585948 546004 1240 S 3.3 1.1 0:06.96 hydra_pmi_proxy
14596 root 20 0 585948 545988 1240 S 3.0 1.1 3:30.57 hydra_pmi_proxy
15317 root 20 0 585948 545616 1232 S 3.0 1.1 0:09.07 hydra_pmi_proxy
You can see the hello test progrem is still running. And the debug message is: [15500] DEBUG: thread id [15500] reached max epoch duration (1000 usec) [15500] DEBUG: interrupting thread [15500] [15500] DEBUG: Handling interrupt thread [15500] pthread: 0x7f3586b4cc00 [15500] DEBUG: thread id [15500] last epoch was 1050 usec ago [15500] DEBUG: thread id [15500] reached min epoch duration (1000 usec) [15500] DEBUG: read stall L2 cycles diff 0; llc_hit 0; cycles diff remote_dram 0; local_dram 0 [15500] DEBUG: overhead cycles: 963331; immediate overhead 12227; stall cycles: 0; delay cycles: 0 [15500] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead [15500] DEBUG: thread id [15500] last epoch was 40 usec ago [15500] DEBUG: thread id [15500] last epoch was 96 usec ago [15498] ERROR: failed to set process local rank
Yes, we can ran this MPI program without the NVM latency emulator but we want to test the MPI progrem in MVN. So is there someting should to be set as we run the emulator?
It seems many more processes are running than the expected number. Can you tell why? Kill all processes related to the MPI program or restart the system. In the case above, I would expect to see process id '15500' in the 'ps' output.
It seems MPI_finalize() is hang. You can debug your program and confirm that the MPI_Finalize() function is called and never returns. Probably some MPI process could not reach MPI_finalize yet. You need to track all expected processes were started and didn't crash. If any crashed, we need to get the stack trace.
I restart the system and run MPI program again. But there something weired.
I track the progress:
[ZHduan@node11 ~]$ ps aux | grep hello
root 2743 0.0 0.0 113048 1428 pts/0 S+ 10:24 0:00 sh run.sh mpirun -np 4 /home/liao/hello
root 2771 0.0 0.0 113048 1448 pts/0 S+ 10:24 0:00 /bin/bash /home/ZHduan/HME-q/scripts/runenv.sh mpirun -np 4 /home/liao/hello
root 2868 29.9 1.1 587488 545692 pts/0 Rl+ 10:24 1:18 mpirun -np 4 /home/liao/hello
root 2872 1.1 0.0 0 0 ? Zs 10:25 0:02 [hello]
This time seems like there is just one process of hello is running.
And the top message is:
2870 root 20 0 585948 545996 1240 R 97.3 1.1 3:36.59 hydra_pmi_proxy
2691 root 20 0 148364 5704 4204 R 30.9 0.0 1:08.25 sshd
2868 root 20 0 587488 545692 1368 S 26.9 1.1 1:12.74 mpirun
135 root 20 0 0 0 0 S 9.0 0.0 0:20.98 kworker/0:1
2874 root 20 0 590716 546052 1652 S 8.6 1.1 0:21.17 hello
But the Quartz's debug message is infinite loop to output message, and I interrupt and I got this debug messages: [2870] INFO: Registering thread tid [2870] [2870] INFO: Binding thread tid [2870] pthread: 0x7fb26124e7c0 on processor 2 [2870] DEBUG: thread id [2870] last epoch was 197051271 usec ago [2870] DEBUG: thread id [2870] reached max epoch duration (1000 usec) [2870] DEBUG: interrupting thread [2870] [2870] DEBUG: Handling interrupt thread [2870] pthread: 0x7fb26124e7c0 [2870] DEBUG: thread id [2870] last epoch was 197051287 usec ago [2870] DEBUG: thread id [2870] reached min epoch duration (1000 usec) [2870] DEBUG: read stall L2 cycles diff 105501; llc_hit 591; cycles diff remote_dram 66; local_dram 228 [2870] DEBUG: overhead cycles: 19188; immediate overhead 19188; stall cycles: 24814; delay cycles: 19895 [2870] DEBUG: injecting delay of 707 cycles (0 usec) - discounted overhead [2870] DEBUG: thread id [2870] last epoch was 34 usec ago [2870] DEBUG: thread id [2870] last epoch was 84 usec ago [2870] DEBUG: thread id [2870] last epoch was 134 usec ago [2870] DEBUG: thread id [2870] last epoch was 184 usec ago [2870] DEBUG: thread id [2870] last epoch was 234 usec ago [2870] DEBUG: thread id [2870] last epoch was 284 usec ago [2870] DEBUG: thread id [2870] last epoch was 334 usec ago [2870] DEBUG: thread id [2870] last epoch was 384 usec ago [2870] DEBUG: thread id [2870] last epoch was 434 usec ago [2870] DEBUG: thread id [2870] last epoch was 484 usec ago [2870] DEBUG: thread id [2870] last epoch was 534 usec ago [2870] DEBUG: thread id [2870] last epoch was 584 usec ago [2870] DEBUG: thread id [2870] last epoch was 633 usec ago [2870] DEBUG: thread id [2870] last epoch was 683 usec ago [2870] DEBUG: thread id [2870] last epoch was 733 usec ago [2870] DEBUG: thread id [2870] last epoch was 783 usec ago [2870] DEBUG: thread id [2870] last epoch was 833 usec ago [2870] DEBUG: thread id [2870] last epoch was 883 usec ago [2870] DEBUG: thread id [2870] last epoch was 932 usec ago [2870] DEBUG: thread id [2870] last epoch was 982 usec ago [2870] DEBUG: thread id [2870] last epoch was 1043 usec ago [2870] DEBUG: thread id [2870] reached max epoch duration (1000 usec) [2870] DEBUG: interrupting thread [2870] [2870] DEBUG: Unsetting process local rank [mpiexec@node11] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@node11] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@node11] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion [mpiexec@node11] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion [2868] DEBUG: Unsetting process local rank [2868] ERROR: failed to unset process local rank
Here is the message I got from dmesg: [ 197.781592] nvmemul: module verification failed: signature and/or required key missing - tainting kernel [ 197.781966] nvmemul: Loading. Initializing... [ 197.781972] nvmemul: major is 246 [ 200.663357] nvmemul: ioctl command: 0x8008aa00 [ 200.663422] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 200.663430] nvmemul: ioctl command: 0x8008aa00 [ 200.663450] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 200.663454] nvmemul: ioctl command: 0x8008aa00 [ 200.663482] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 200.663486] nvmemul: ioctl command: 0x8008aa00 [ 200.663506] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3 [ 202.961989] nvmemul: ioctl command: 0x8008aa00 [ 202.962049] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 202.962057] nvmemul: ioctl command: 0x8008aa00 [ 202.962076] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 202.962081] nvmemul: ioctl command: 0x8008aa00 [ 202.962100] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 202.962105] nvmemul: ioctl command: 0x8008aa00 [ 202.962124] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3 [ 204.826528] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs [ 206.158277] nvmemul: ioctl command: 0x8008aa00 [ 206.158339] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 206.158347] nvmemul: ioctl command: 0x8008aa00 [ 206.158369] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 206.158374] nvmemul: ioctl command: 0x8008aa00 [ 206.158396] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 206.158400] nvmemul: ioctl command: 0x8008aa00 [ 206.158422] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3 [ 206.180462] nvmemul: ioctl command: 0x8008aa00 [ 206.180514] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 206.180539] nvmemul: ioctl command: 0x8008aa00 [ 206.180561] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 206.180604] nvmemul: ioctl command: 0x8008aa00 [ 206.180624] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 206.180644] nvmemul: ioctl command: 0x8008aa00 [ 206.180663] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3 [ 206.182356] nvmemul: ioctl command: 0x8008aa00 [ 206.182381] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 206.182388] nvmemul: ioctl command: 0x8008aa00 [ 206.182408] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 206.182413] nvmemul: ioctl command: 0x8008aa00 [ 206.182432] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 206.182436] nvmemul: ioctl command: 0x8008aa00 [ 206.182456] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3 [ 206.185735] nvmemul: ioctl command: 0x8008aa00 [ 206.185759] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 206.185766] nvmemul: ioctl command: 0x8008aa00 [ 206.185785] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 206.185789] nvmemul: ioctl command: 0x8008aa00 [ 206.185808] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 206.185812] nvmemul: ioctl command: 0x8008aa00 [ 206.185830] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3
So what happened when I run MPI progress When I use Quartz.
The command I use is: sh run.sh mpirun -np 4 /home/liao/hello
The run.sh is: _INPUT_COMMAN=$@ _LOAD_PATH="/home/ZHduan/HME-q/scripts/setupdev.sh" _RUN_PATH="/home/ZHduan/HME-q/scripts/runenv.sh" _ROOT="sudo" _LOAD="load" _RELOAD="reload" echo ${_INPUT_COMMAN} Init() { ${_ROOT} ${_LOAD_PATH} ${_RELOAD} ${_ROOT} ${_LOAD_PATH} ${_LOAD} } Start() { ${_RUN_PATH} ${_INPUT_COMMAN} } Init Start
The hello test is:
int main (int argc, char **argv) { int myid, numprocs; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &myid); MPI_Comm_size (MPI_COMM_WORLD, &numprocs); MPI_Get_processor_name (processor_name, &namelen); fprintf (stderr, "Hello World! Process %d of %d on %s\n", myid, numprocs, processor_name); MPI_Finalize (); return 0; }
The dmesg output seems correct, I can see 6 processes attempting to enable the required performance counters.
In the 'nvmemul.ini' file do you have 'bandwidth.enable' set to false? Please make sure the bandwidth modeling is disabled.
When you started the mpi program, it spawn 4 processes as expected (pids 2872, 2873, 2874 and 2875). You can verify the processes tree by running 'ps axjf'. It will shown the main MPI process and its children processes.
Attemp to run the mpi program with just one process and let's see what happens. Attempt to run the MPI program with both bandwidth and latency emulation turned of in the nvmemul.ini.
When you run the MPI program with the NVM emulator, some process is hung. If this process doesn't return the main MPI process is not finished. We need to understand which process is hung and why. This will gives the direction to follow.
The files "/tmp/emul_lock_file" and "/tmp/emul_process_local_rank" should be automatically deleted when the program terminates unless some process is hung or crashes. Delete these files and do not set the variable EMUL_LOCAL_PROCESSES for now.
The NVM emulator is turning CPU counters on and setting threads COPU affinity. The threads are interrupt and delay is injected. I can see the interruption is handled properly. I can't see yet why the emulator would cause processes to hung indefinitelly.
Also please run any other type of application using the NVM emulator. Let's see if the problem is only related to MPI.
So that's a lot of advice you said , let us list it below:
In the 'nvmemul.ini' file do you have 'bandwidth.enable' set to false? Yes, we use this configuration to get the message above:
latency:
{
enable = true;
inject_delay = true;
read = 200 ;
write = 700;
max_epoch_duration_us = 1000 ;
min_epoch_duration_us = 1000 ;
calibration = false;
};
bandwidth:
{
enable = false;
model = "/tmp/bandwidth_model";
read = 500;
write = 500;
};
topology:
{
mc_pci = "/tmp/mc_pci_bus";
physical_nodes = "0,1";
hyperthreading = true; # do not use multiple hardware threads per core
};
statistics:
{
enable = true;
file = "/tmp/statistics";
};
debug:
{
# debugging level
level = 5;
verbose = 0;
# modules set to True produce debugging output
module:
{
all = true;
};
};
So the bandwidth modeling is disabled.
I run "sh run.sh mpirun -np 4 /home/liao/hello" again , and this time I got the end:
[19078] DEBUG: thread id [19078] reached min epoch duration (1000 usec)
[19078] DEBUG: read stall L2 cycles diff 5996; llc_hit 910; cycles diff remote_dram 0; local_dram 0
[19078] DEBUG: overhead cycles: 3972744; immediate overhead 11990; stall cycles: 0; delay cycles: 0
[19078] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19078] DEBUG: thread id [19078] last epoch was 40 usec ago
[19078] DEBUG: thread id [19078] last epoch was 96 usec ago
[19078] DEBUG: thread id [19078] last epoch was 152 usec ago
[19078] DEBUG: thread id [19078] last epoch was 207 usec ago
[19078] DEBUG: thread id [19078] last epoch was 263 usec ago
[19078] DEBUG: thread id [19078] last epoch was 318 usec ago
[19078] DEBUG: thread id [19078] last epoch was 373 usec ago
[19078] DEBUG: thread id [19078] last epoch was 429 usec ago
[19078] DEBUG: thread id [19078] last epoch was 484 usec ago
[19078] DEBUG: thread id [19078] last epoch was 540 usec ago
[19078] DEBUG: thread id [19078] last epoch was 595 usec ago
[19078] DEBUG: thread id [19078] last epoch was 651 usec ago
[19078] DEBUG: thread id [19078] last epoch was 706 usec ago
[19078] DEBUG: thread id [19078] last epoch was 762 usec ago
[19078] DEBUG: thread id [19078] last epoch was 817 usec ago
[19078] DEBUG: thread id [19078] last epoch was 873 usec ago
[19078] DEBUG: thread id [19078] last epoch was 928 usec ago
[19078] DEBUG: thread id [19078] last epoch was 984 usec ago
[19078] DEBUG: thread id [19078] last epoch was 1039 usec ago
[19079] DEBUG: thread id [19079] last epoch was 24 usec ago
[19079] DEBUG: thread id [19079] last epoch was 80 usec ago
[19079] DEBUG: thread id [19079] last epoch was 86 usec ago
[19079] DEBUG: thread id [19079] last epoch was 141 usec ago
[19079] DEBUG: thread id [19079] last epoch was 197 usec ago
[19079] DEBUG: thread id [19079] last epoch was 253 usec ago
[19079] DEBUG: thread id [19079] last epoch was 311 usec ago
[19079] DEBUG: thread id [19079] last epoch was 367 usec ago
[19079] INFO: Registering thread tid [19079]
[19079] INFO: Binding thread tid [19079] pthread: 0x7f26d794ec00 on processor 1
[19079] DEBUG: thread id [19079] last epoch was 110088716323 usec ago
[19079] DEBUG: thread id [19079] reached min epoch duration (1000 usec)
[19079] DEBUG: thread id [19079] last epoch was 110088716331 usec ago
[19079] DEBUG: thread id [19079] reached min epoch duration (1000 usec)
[19079] DEBUG: read stall L2 cycles diff 2818664; llc_hit 86660; cycles diff remote_dram 4098; local_dram 7960
[19079] DEBUG: overhead cycles: 7461678; immediate overhead 14855; stall cycles: 623736; delay cycles: 273725
[19079] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19079] DEBUG: thread id [19079] last epoch was 6 usec ago
[19079] DEBUG: thread id [19079] last epoch was 6 usec ago
hello-in init done
Hello World! Process 0 of 4 on node11
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(293).....................: MPI_Finalize failed
MPI_Finalize(213).....................:
MPID_Finalize(118)....................:
MPIDI_CH3U_VC_WaitForClose(382).......: an error occurred while the device was waiting for all open connections to close
MPIDI_CH3I_Progress(309)..............:
MPIDI_CH3U_Check_for_failed_procs(470): PMI_KVS_Get failed
hello-in init done
Hello World! Process 1 of 4 on node11
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(293).....................: MPI_Finalize failed
MPI_Finalize(213).....................:
MPID_Finalize(118)....................:
MPIDI_CH3U_VC_WaitForClose(382).......: an error occurred while the device was waiting for all open connections to close
MPIDI_CH3I_Progress(309)..............:
MPIDI_CH3U_Check_for_failed_procs(470): PMI_KVS_Get failed
hello-in init done
Hello World! Process 2 of 4 on node11
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(293).....................: MPI_Finalize failed
MPI_Finalize(213).....................:
MPID_Finalize(118)....................:
MPIDI_CH3U_VC_WaitForClose(382).......: an error occurred while the device was waiting for all open connections to close
MPIDI_CH3I_Progress(309)..............:
MPIDI_CH3U_Check_for_failed_procs(470): PMI_KVS_Get failed
hello-in init done
Hello World! Process 3 of 4 on node11
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(293).....................: MPI_Finalize failed
MPI_Finalize(213).....................:
MPID_Finalize(118)....................:
MPIDI_CH3U_VC_WaitForClose(382).......: an error occurred while the device was waiting for all open connections to close
MPIDI_CH3I_Progress(309)..............:
MPIDI_CH3U_Check_for_failed_procs(470): PMI_KVS_Get failed
0 usec) - discounted overhead
[19077] DEBUG: thread id [19077] last epoch was 39 usec ago
[19077] INFO: Registering thread tid [19077]
[19077] INFO: Binding thread tid [19077] pthread: 0x7f66c3434c00 on processor 1
[19077] DEBUG: thread id [19077] last epoch was 110088714187 usec ago
[19077] DEBUG: thread id [19077] reached min epoch duration (1000 usec)
[19077] DEBUG: thread id [19077] last epoch was 110088714189 usec ago
[19077] DEBUG: thread id [19077] reached max epoch duration (1000 usec)
[19077] DEBUG: interrupting thread [19077]
[19077] DEBUG: thread id [19077] last epoch was 110088714196 usec ago
[19077] DEBUG: thread id [19077] reached min epoch duration (1000 usec)
[19077] DEBUG: read stall L2 cycles diff 1021637; llc_hit 68414; cycles diff remote_dram 32; local_dram 2506
[19077] DEBUG: overhead cycles: 10560146; immediate overhead 57889; stall cycles: 4240; delay cycles: 1817
[19077] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19077] DEBUG: Handling interrupt thread [19077] pthread: 0x7f66c3434c00
[19077] DEBUG: thread id [19077] last epoch was 6 usec ago
[19077] DEBUG: thread id [19077] last epoch was 17 usec ago
[19077] DEBUG: thread id [19077] last epoch was 25 usec ago
[19078] DEBUG: thread id [19078] reached max epoch duration (1000 usec)
[19078] DEBUG: interrupting thread [19078]
[19080] DEBUG: thread id [19080] last epoch was 59339 usec ago
[19080] DEBUG: thread id [19080] reached max epoch duration (1000 usec)
[19080] DEBUG: interrupting thread [19080]
[19080] DEBUG: Handling interrupt thread [19080] pthread: 0x7fac9237ec00
[19080] DEBUG: thread id [19080] last epoch was 59517 usec ago
[19080] DEBUG: thread id [19080] reached min epoch duration (1000 usec)
[19080] DEBUG: read stall L2 cycles diff 865884; llc_hit 51120; cycles diff remote_dram 0; local_dram 3948
[19080] DEBUG: overhead cycles: 8390972; immediate overhead 14748; stall cycles: 0; delay cycles: 0
[19080] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19080] DEBUG: thread id [19080] last epoch was 34 usec ago
[19080] INFO: Registering thread tid [19080]
[19080] INFO: Binding thread tid [19080] pthread: 0x7fac9237ec00 on processor 1
[19080] DEBUG: thread id [19080] last epoch was 110088714354 usec ago
[19080] DEBUG: thread id [19080] reached min epoch duration (1000 usec)
[19080] DEBUG: thread id [19080] last epoch was 110088714357 usec ago
[19080] DEBUG: thread id [19080] last epoch was 110088714364 usec ago
[19080] DEBUG: thread id [19080] reached min epoch duration (1000 usec)
[19080] DEBUG: thread id [19080] reached max epoch duration (1000 usec)
[19080] DEBUG: read stall L2 cycles diff 1144928; llc_hit 69135; cycles diff remote_dram 253; local_dram 2994
[19080] DEBUG: interrupting thread [19080]
[19080] DEBUG: overhead cycles: 8411362; immediate overhead 20390; stall cycles: 33569; delay cycles: 14386
[19080] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19080] DEBUG: Handling interrupt thread [19080] pthread: 0x7fac9237ec00
[19080] DEBUG: thread id [19080] last epoch was 4 usec ago
[19080] DEBUG: thread id [19080] last epoch was 14 usec ago
You can see there are some MPI output in the middle of debug message, we can obvious find that MPI_Finalize(293).....................: MPI_Finalize failed
This time we don't set the EMUL_LOCAL_PROCESSES to 4. And we got this.
Then we try to set EMUL_LOCAL_PROCESSES to 4 and run the program. And the program is hung.
The debug message is:
```javascript
[19218] DEBUG: thread id [19218] last epoch was 457 usec ago
[19219] DEBUG: [19219] INFO: Detected CPU model 'Haswell Xeon'
[19219] DEBUG: Possible NUMA nodes are 1024
[19219] DEBUG: NUMA nodes allowed are 1024
[19219] DEBUG: NUMA configured CPUs are 20
[19219] INFO: 10 CPUs on physical node 0
[19219] INFO: 10 CPUs on physical node 1
[19219] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[19219] INFO: node: 1, pci addr: 7f:14.0
[19219] INFO: node: 1, pci bus: 0x7f
[19219] INFO: node: 1, pci addr: 7f:14.1
[19219] INFO: node: 1, pci addr: 7f:17.0
[19219] INFO: node: 1, pci addr: 7f:17.1
[19219] INFO: Number of physical nodes 2
[19219] INFO: Number of virtual nodes 1
[19219] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[19219] INFO: measuring latency: latency is 87 ns
[19219] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[19219] INFO: measuring latency: latency is 141 ns
[19219] INFO: Fusing physical nodes 0 1 into virtual node 0
[19219] INFO: Initializing latency model
[19219] DEBUG: setting process local rank for 4 local processes
[19220] DEBUG: [19220] INFO: Detected CPU model 'Haswell Xeon'
[19220] DEBUG: Possible NUMA nodes are 1024
[19220] DEBUG: NUMA nodes allowed are 1024
[19220] DEBUG: NUMA configured CPUs are 20
[19220] INFO: 10 CPUs on physical node 0
[19220] INFO: 10 CPUs on physical node 1
[19220] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[19220] INFO: node: 1, pci addr: 7f:14.0
[19220] INFO: node: 1, pci bus: 0x7f
[19220] INFO: node: 1, pci addr: 7f:14.1
[19220] INFO: node: 1, pci addr: 7f:17.0
[19220] INFO: node: 1, pci addr: 7f:17.1
[19220] INFO: Number of physical nodes 2
[19220] INFO: Number of virtual nodes 1
[19220] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[19220] INFO: measuring latency: latency is 89 ns
[19220] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[19220] INFO: measuring latency: latency is 140 ns
[19220] INFO: Fusing physical nodes 0 1 into virtual node 0
[19220] INFO: Initializing latency model
[19220] DEBUG: setting process local rank for 4 local processes
[19219] ERROR: failed to set process local rank
[19220] ERROR: failed to set process local rank
[19218] DEBUG: thread id [19218] last epoch was 19690 usec ago
[19218] DEBUG: thread id [19218] reached max epoch duration (1000 usec)
[19218] DEBUG: interrupting thread [19218]
[19218] DEBUG: Handling interrupt thread [19218] pthread: 0x7fd70dc90c00
[19218] DEBUG: thread id [19218] last epoch was 19708 usec ago
[19218] DEBUG: thread id [19218] reached min epoch duration (1000 usec)
[19218] DEBUG: read stall L2 cycles diff 0; llc_hit 0; cycles diff remote_dram 0; local_dram 0
[19218] DEBUG: overhead cycles: 586877; immediate overhead 17048; stall cycles: 0; delay cycles: 0
[19218] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19218] DEBUG: thread id [19218] last epoch was 33 usec ago
[19218] DEBUG: thread id [19218] last epoch was 89 usec ago
[19218] DEBUG: thread id [19218] last epoch was 144 usec ago
[19218] DEBUG: thread id [19218] last epoch was 200 usec ago
[19218] DEBUG: thread id [19218] last epoch was 256 usec ago
[19218] DEBUG: thread id [19218] last epoch was 311 usec ago
[19218] DEBUG: thread id [19218] last epoch was 366 usec ago
[19218] DEBUG: thread id [19218] last epoch was 422 usec ago
[19218] DEBUG: thread id [19218] last epoch was 477 usec ago
[19218] DEBUG: thread id [19218] last epoch was 533 usec ago
Then we use ps axjf and we got this:
1 1311 1311 1311 ? -1 Ss 0 0:00 /usr/sbin/sshd -D
1311 17056 17056 17056 ? -1 Ss 0 0:00 \_ sshd: ZHduan [priv]
17056 17060 17056 17056 ? -1 S 1001 0:00 | \_ sshd: ZHduan@pts/0
17060 17061 17061 17061 pts/0 19149 Ss 1001 0:00 | \_ -bash
17061 17145 17145 17061 pts/0 19149 S 0 0:00 | \_ su root
17145 17151 17151 17061 pts/0 19149 S 0 0:00 | \_ bash
17151 19149 19149 17061 pts/0 19149 S+ 0 0:00 | \_ sh run.sh mpirun -np 4 /home/liao/hello
19149 19177 19149 17061 pts/0 19149 S+ 0 0:00 | \_ /bin/bash /home/ZHduan/HME-q/scripts/runenv.sh mpi
19177 19212 19149 17061 pts/0 19149 Sl+ 0 0:03 | \_ mpirun -np 4 /home/liao/hello
19212 19214 19214 19214 ? -1 Ssl 0 0:03 | \_ /usr/lib64/mpich/bin/hydra_pmi_proxy --con
19214 19218 19218 19218 ? -1 Ssl 0 0:02 | \_ /home/liao/hello
1311 19225 19225 19225 ? -1 Ss 0 0:00 \_ sshd: ZHduan [priv]
19225 19229 19225 19225 ? -1 S 1001 0:00 \_ sshd: ZHduan@pts/1
19229 19230 19230 19230 pts/1 19325 Ss 1001 0:00 \_ -bash
19230 19283 19283 19230 pts/1 19325 S 0 0:00 \_ su root
19283 19289 19289 19230 pts/1 19325 S 0 0:00 \_ bash
19289 19325 19325 19230 pts/1 19325 R+ 0 0:00 \_ ps axjf
1 1312 1312 1312 ? -1 Ssl 0 0:00 /usr/sbin/rsyslogd -n
1 1326 1326 1326 ? -1 Ss 0 0:00 /usr/sbin/atd -f
1 1330 1330 1330 ? -1 Ss 0 0:01 /usr/sbin/crond -n
1 2022 2022 2022 ? -1 Ss 0 0:00 /usr/libexec/postfix/master -w
2022 2045 2022 2022 ? -1 S 89 0:00 \_ qmgr -l -t unix -u
2022 18841 2022 2022 ? -1 S 89 0:00 \_ pickup -l -t unix -u
1 2490 2490 2490 ? -1 Ss 0 0:01 login -- liao
2490 2572 2572 2572 tty1 2643 Ss 1000 0:00 \_ -bash
2572 2637 2637 2572 tty1 2643 S 0 0:00 \_ su
2637 2643 2643 2572 tty1 2643 S+ 0 0:00 \_ bash
1 19252 1005 1005 ? -1 Sl 0 0:00 /usr/sbin/abrt-dbus -t133
1 19284 19284 19284 ? -1 Ssl 0 0:00 /usr/libexec/fprintd
The dmesg seems like the same of we mentioned above.
[19751] DEBUG: Handling interrupt thread [19751] pthread: 0x7f82c483c7c0
[19751] DEBUG: thread id [19751] last epoch was 110967543308 usec ago
[19751] DEBUG: thread id [19751] reached min epoch duration (1000 usec)
[19751] DEBUG: read stall L2 cycles diff 123392; llc_hit 714; cycles diff remote_dram 83; local_dram 255
[19751] DEBUG: overhead cycles: 31541; immediate overhead 31541; stall cycles: 31359; delay cycles: 20906
[19751] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19751] DEBUG: thread id [19751] last epoch was 24 usec ago
[19751] DEBUG: thread id [19751] last epoch was 81 usec ago
[19751] DEBUG: thread id [19751] last epoch was 138 usec ago
[19751] DEBUG: thread id [19751] last epoch was 195 usec ago
[19751] DEBUG: thread id [19751] last epoch was 252 usec ago
[19751] DEBUG: thread id [19751] last epoch was 309 usec ago
[19751] DEBUG: thread id [19751] last epoch was 365 usec ago
[19751] DEBUG: thread id [19751] last epoch was 421 usec ago
[19751] DEBUG: thread id [19751] last epoch was 478 usec ago
[19751] DEBUG: thread id [19751] last epoch was 535 usec ago
[19751] DEBUG: thread id [19751] last epoch was 591 usec ago
[19751] DEBUG: thread id [19751] last epoch was 648 usec ago
[19751] DEBUG: thread id [19751] last epoch was 705 usec ago
[19751] DEBUG: thread id [19751] last epoch was 761 usec ago
[19751] DEBUG: thread id [19751] last epoch was 818 usec ago
[19751] DEBUG: thread id [19751] last epoch was 874 usec ago
[19751] DEBUG: thread id [19751] last epoch was 930 usec ago
[19751] DEBUG: thread id [19751] last epoch was 987 usec ago
[19751] DEBUG: thread id [19751] last epoch was 1043 usec ago
[19751] DEBUG: thread id [19751] reached max epoch duration (1000 usec)
[19751] DEBUG: interrupting thread [19751]
[19753] WARNING: No complete memory-controller pci topology found in /tmp/mc_pci_bus
[19753] WARNING: No EMUL_LOCAL_PROCESSES variable set, skipping rank setting
And program hung, So we set EMUL_LOCAL_PROCESSES to 1. And this time it works. The debug message is:
Turbo Boost disabled for all CPUs
/home/ZHduan/HME-q/scripts/../build/src/lib/libnvmemul.so
/home/ZHduan/HME-q/scripts/../nvmemul.ini
[19940] DEBUG: [19940] INFO: Detected CPU model 'Haswell Xeon'
[19940] DEBUG: Possible NUMA nodes are 1024
[19940] DEBUG: NUMA nodes allowed are 1024
[19940] DEBUG: NUMA configured CPUs are 20
[19940] INFO: 10 CPUs on physical node 0
[19940] INFO: 10 CPUs on physical node 1
[19940] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[19940] INFO: node: 1, pci addr: 7f:14.0
[19940] INFO: node: 1, pci bus: 0x7f
[19940] INFO: node: 1, pci addr: 7f:14.1
[19940] INFO: node: 1, pci addr: 7f:17.0
[19940] INFO: node: 1, pci addr: 7f:17.1
[19940] WARNING: No complete memory-controller pci topology found in /tmp/mc_pci_bus
[19940] INFO: Number of physical nodes 2
[19940] INFO: Number of virtual nodes 1
[19940] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[19940] INFO: measuring latency: latency is 81 ns
[19940] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[19940] INFO: measuring latency: latency is 122 ns
[19940] INFO: Fusing physical nodes 0 1 into virtual node 0
[19940] INFO: Initializing latency model
[19940] WARNING: EMUL_PROCESSES_PER_SYSTEM value is 1, skipping rank setting
[19940] INFO: Registering thread tid [19940]
[19940] INFO: Binding thread tid [19940] pthread: 0x7fa3c9c6c7c0 on processor 0
[19940] DEBUG: thread id [19940] last epoch was 111177225411 usec ago
[19940] DEBUG: thread id [19940] reached max epoch duration (1000 usec)
[19940] DEBUG: interrupting thread [19940]
[19940] DEBUG: Handling interrupt thread [19940] pthread: 0x7fa3c9c6c7c0
[19940] DEBUG: thread id [19940] last epoch was 111177225471 usec ago
[19940] DEBUG: thread id [19940] reached min epoch duration (1000 usec)
[19940] DEBUG: read stall L2 cycles diff 129025; llc_hit 1025; cycles diff remote_dram 61; local_dram 237
[19940] DEBUG: overhead cycles: 30621; immediate overhead 30621; stall cycles: 24169; delay cycles: 15452
[19940] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19940] DEBUG: thread id [19940] last epoch was 27 usec ago
[19940] DEBUG: thread id [19940] last epoch was 84 usec ago
[19940] DEBUG: thread id [19940] last epoch was 140 usec ago
[19940] DEBUG: thread id [19940] last epoch was 197 usec ago
[19940] DEBUG: thread id [19940] last epoch was 253 usec ago
[19940] DEBUG: thread id [19940] last epoch was 309 usec ago
[19940] DEBUG: thread id [19940] last epoch was 365 usec ago
[19940] DEBUG: thread id [19940] last epoch was 421 usec ago
[19940] DEBUG: thread id [19940] last epoch was 478 usec ago
[19940] DEBUG: thread id [19940] last epoch was 537 usec ago
[19940] DEBUG: thread id [19940] last epoch was 593 usec ago
[19940] DEBUG: thread id [19940] last epoch was 712 usec ago
[19940] DEBUG: thread id [19940] last epoch was 770 usec ago
[19940] DEBUG: thread id [19940] last epoch was 827 usec ago
[19940] DEBUG: thread id [19940] last epoch was 883 usec ago
[19940] DEBUG: thread id [19940] last epoch was 939 usec ago
[19940] DEBUG: thread id [19940] last epoch was 995 usec ago
[19940] DEBUG: thread id [19940] last epoch was 1052 usec ago
[19940] DEBUG: thread id [19940] reached max epoch duration (1000 usec)
[19940] DEBUG: interrupting thread [19940]
[19942] WARNING: No complete memory-controller pci topology found in /tmp/mc_pci_bus
[19942] WARNING: EMUL_PROCESSES_PER_SYSTEM value is 1, skipping rank setting
[19945] WARNING: No complete memory-controller pci topology found in /tmp/mc_pci_bus
[19945] WARNING: EMUL_PROCESSES_PER_SYSTEM value is 1, skipping rank setting
hello-in process
[19945] DEBUG: [19945] INFO: Detected CPU model 'Haswell Xeon'
[19945] DEBUG: Possible NUMA nodes are 1024
[19945] DEBUG: NUMA nodes allowed are 1024
[19945] DEBUG: NUMA configured CPUs are 20
[19945] INFO: 10 CPUs on physical node 0
[19945] INFO: 10 CPUs on physical node 1
[19945] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[19945] INFO: node: 1, pci addr: 7f:14.0
[19945] INFO: node: 1, pci bus: 0x7f
[19945] INFO: node: 1, pci addr: 7f:14.1
[19945] INFO: node: 1, pci addr: 7f:17.0
[19945] INFO: node: 1, pci addr: 7f:17.1
[19945] INFO: Number of physical nodes 2
[19945] INFO: Number of virtual nodes 1
[19945] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[19945] INFO: measuring latency: latency is 81 ns
[19945] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[19945] INFO: measuring latency: latency is 123 ns
[19945] INFO: Fusing physical nodes 0 1 into virtual node 0
[19945] INFO: Initializing latency model
[19945] INFO: Registering thread tid [19945]
[19945] INFO: Binding thread tid [19945] pthread: 0x7f0efa5b3c00 on processor 0
[19945] DEBUG: thread id [19945] last epoch was 111181988436 usec ago
[19945] DEBUG: thread id [19945] reached max epoch duration (1000 usec)
[19945] DEBUG: interrupting thread [19945]
[19945] DEBUG: Handling interrupt thread [19945] pthread: 0x7f0efa5b3c00
[19945] DEBUG: thread id [19945] last epoch was 111181988476 usec ago
[19945] DEBUG: thread id [19945] reached min epoch duration (1000 usec)
[19945] DEBUG: read stall L2 cycles diff 254791; llc_hit 1436; cycles diff remote_dram 163; local_dram 811
[19945] DEBUG: overhead cycles: 21212; immediate overhead 21212; stall cycles: 49213; delay cycles: 30808
[19945] DEBUG: injecting delay of 9596 cycles (4 usec) - discounted overhead
[19945] DEBUG: thread id [19945] last epoch was 20 usec ago
[19945] DEBUG: thread id [19945] last epoch was 76 usec ago
[19945] DEBUG: thread id [19945] last epoch was 131 usec ago
[19945] DEBUG: thread id [19945] last epoch was 160 usec ago
[19945] DEBUG: thread id [19945] last epoch was 186 usec ago
[19945] DEBUG: thread id [19945] last epoch was 320 usec ago
[19945] DEBUG: thread id [19945] last epoch was 380 usec ago
[19945] DEBUG: thread id [19945] last epoch was 436 usec ago
[19945] DEBUG: thread id [19945] last epoch was 491 usec ago
[19945] DEBUG: thread id [19945] last epoch was 547 usec ago
[19945] DEBUG: thread id [19945] last epoch was 607 usec ago
[19945] DEBUG: thread id [19945] last epoch was 668 usec ago
[19945] DEBUG: thread id [19945] last epoch was 725 usec ago
[19945] DEBUG: thread id [19945] last epoch was 782 usec ago
[19945] DEBUG: thread id [19945] last epoch was 838 usec ago
[19945] DEBUG: thread id [19945] last epoch was 894 usec ago
[19945] DEBUG: thread id [19945] last epoch was 949 usec ago
[19945] DEBUG: thread id [19945] last epoch was 1005 usec ago
[19945] DEBUG: thread id [19945] reached max epoch duration (1000 usec)
[19945] DEBUG: interrupting thread [19945]
[19945] DEBUG: Handling interrupt thread [19945] pthread: 0x7f0efa5b3c00
[19945] DEBUG: thread id [19945] last epoch was 1079 usec ago
[19945] DEBUG: thread id [19945] reached min epoch duration (1000 usec)
[19945] DEBUG: read stall L2 cycles diff 326444; llc_hit 3851; cycles diff remote_dram 25; local_dram 1304
[19945] DEBUG: overhead cycles: 141910; immediate overhead 141910; stall cycles: 6531; delay cycles: 4088
[19945] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19945] DEBUG: thread id [19945] last epoch was 46 usec ago
[19945] DEBUG: thread id [19945] last epoch was 103 usec ago
[19945] DEBUG: thread id [19945] last epoch was 160 usec ago
[19945] DEBUG: thread id [19945] last epoch was 217 usec ago
[19945] DEBUG: thread id [19945] last epoch was 274 usec ago
[19945] DEBUG: thread id [19945] last epoch was 331 usec ago
[19945] DEBUG: thread id [19945] last epoch was 394 usec ago
[19945] DEBUG: thread id [19945] last epoch was 450 usec ago
[19945] DEBUG: thread id [19945] last epoch was 507 usec ago
[19945] DEBUG: thread id [19945] last epoch was 564 usec ago
[19945] DEBUG: thread id [19945] last epoch was 621 usec ago
[19945] DEBUG: thread id [19945] last epoch was 679 usec ago
[19945] DEBUG: thread id [19945] last epoch was 736 usec ago
[19945] DEBUG: thread id [19945] last epoch was 792 usec ago
[19945] DEBUG: thread id [19945] last epoch was 848 usec ago
[19945] DEBUG: thread id [19945] last epoch was 903 usec ago
[19945] DEBUG: thread id [19945] last epoch was 959 usec ago
[19945] DEBUG: thread id [19945] last epoch was 1016 usec ago
[19945] DEBUG: thread id [19945] reached max epoch duration (1000 usec)
[19945] DEBUG: interrupting thread [19945]
[19945] DEBUG: Handling interrupt thread [19945] pthread: 0x7f0efa5b3c00
[19945] DEBUG: thread id [19945] last epoch was 1068 usec ago
[19945] DEBUG: thread id [19945] reached min epoch duration (1000 usec)
[19945] DEBUG: read stall L2 cycles diff 585597; llc_hit 8919; cycles diff remote_dram 478; local_dram 983
[19945] DEBUG: overhead cycles: 277821; immediate overhead 139999; stall cycles: 132865; delay cycles: 83175
[19945] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19945] DEBUG: thread id [19945] last epoch was 5 usec ago
[19945] DEBUG: thread id [19945] last epoch was 16 usec ago
hello-in init done
Hello World! Process 0 of 1 on node11
[19945] DEBUG: thread id [19945] last epoch was 85 usec ago
[19945] DEBUG: thread id [19945] last epoch was 142 usec ago
[19945] INFO: Registering thread tid [19945]
[19945] INFO: Binding thread tid [19945] pthread: 0x7f0efa5b3c00 on processor 1
[19945] DEBUG: thread id [19945] last epoch was 111181991204 usec ago
[19945] DEBUG: thread id [19945] reached min epoch duration (1000 usec)
[19945] DEBUG: thread id [19945] last epoch was 111181991206 usec ago
[19945] DEBUG: thread id [19945] last epoch was 111181991215 usec ago
[19945] DEBUG: thread id [19945] reached min epoch duration (1000 usec)
[19945] DEBUG: thread id [19945] reached max epoch duration (1000 usec)
[19945] DEBUG: read stall L2 cycles diff 384127; llc_hit 3951; cycles diff remote_dram 345; local_dram 631
[19945] DEBUG: interrupting thread [19945]
[19945] DEBUG: overhead cycles: 217299; immediate overhead 22653; stall cycles: 110403; delay cycles: 69114
[19945] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19945] DEBUG: Handling interrupt thread [19945] pthread: 0x7f0efa5b3c00
[19945] DEBUG: thread id [19945] last epoch was 4 usec ago
[19945] DEBUG: thread id [19945] last epoch was 15 usec ago
[19942] DEBUG: [19942] INFO: Detected CPU model 'Haswell Xeon'
[19942] DEBUG: Possible NUMA nodes are 1024
[19942] DEBUG: NUMA nodes allowed are 1024
[19942] DEBUG: NUMA configured CPUs are 20
[19942] INFO: 10 CPUs on physical node 0
[19942] INFO: 10 CPUs on physical node 1
[19942] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[19942] INFO: node: 1, pci addr: 7f:14.0
[19942] INFO: node: 1, pci bus: 0x7f
[19942] INFO: node: 1, pci addr: 7f:14.1
[19942] INFO: node: 1, pci addr: 7f:17.0
[19942] INFO: node: 1, pci addr: 7f:17.1
[19942] INFO: Number of physical nodes 2
[19942] INFO: Number of virtual nodes 1
[19942] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[19942] INFO: measuring latency: latency is 81 ns
[19942] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[19942] INFO: measuring latency: latency is 133 ns
[19942] INFO: Fusing physical nodes 0 1 into virtual node 0
[19942] INFO: Initializing latency model
[19942] INFO: Registering thread tid [19942]
[19942] INFO: Binding thread tid [19942] pthread: 0x7fc4574de7c0 on processor 0
[19942] DEBUG: thread id [19942] last epoch was 111179658945 usec ago
[19942] DEBUG: thread id [19942] reached max epoch duration (1000 usec)
[19942] DEBUG: interrupting thread [19942]
[19942] DEBUG: Handling interrupt thread [19942] pthread: 0x7fc4574de7c0
[19942] DEBUG: thread id [19942] last epoch was 111179658961 usec ago
[19942] DEBUG: thread id [19942] reached min epoch duration (1000 usec)
[19942] DEBUG: read stall L2 cycles diff 239224; llc_hit 1238; cycles diff remote_dram 146; local_dram 690
[19942] DEBUG: overhead cycles: 18026; immediate overhead 18026; stall cycles: 50912; delay cycles: 25647
[19942] DEBUG: injecting delay of 7621 cycles (3 usec) - discounted overhead
[19942] DEBUG: thread id [19942] last epoch was 37 usec ago
[19942] DEBUG: thread id [19942] last epoch was 92 usec ago
[19942] DEBUG: thread id [19942] last epoch was 147 usec ago
[19942] DEBUG: thread id [19942] last epoch was 205 usec ago
[19942] DEBUG: thread id [19942] last epoch was 229 usec ago
[19942] DEBUG: thread id [19942] last epoch was 287 usec ago
[19942] DEBUG: thread id [19942] last epoch was 341 usec ago
[19942] DEBUG: thread id [19942] last epoch was 396 usec ago
[19942] DEBUG: thread id [19942] last epoch was 450 usec ago
[19942] DEBUG: thread id [19942] last epoch was 504 usec ago
[19942] DEBUG: thread id [19942] last epoch was 559 usec ago
[19942] DEBUG: thread id [19942] last epoch was 613 usec ago
[19942] DEBUG: thread id [19942] last epoch was 668 usec ago
[19942] DEBUG: thread id [19942] last epoch was 722 usec ago
[19942] DEBUG: thread id [19942] last epoch was 776 usec ago
[19942] DEBUG: thread id [19942] last epoch was 831 usec ago
[19942] DEBUG: thread id [19942] last epoch was 885 usec ago
[19942] DEBUG: thread id [19942] last epoch was 939 usec ago
[19942] DEBUG: thread id [19942] last epoch was 994 usec ago
[19942] DEBUG: thread id [19942] last epoch was 1048 usec ago
[19942] DEBUG: thread id [19942] reached max epoch duration (1000 usec)
[19942] DEBUG: interrupting thread [19942]
Then we try it to 4 progresses. It loops again. Then I interrupting it.
[20205] DEBUG: thread id [20205] last epoch was 965 usec ago [20205] DEBUG: thread id [20205] last epoch was 1019 usec ago [20205] DEBUG: thread id [20205] reached max epoch duration (1000 usec) [20205] DEBUG: interrupting thread [20205] [20205] DEBUG: Unsetting process local rank [mpiexec@node11] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@node11] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@node11] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion [mpiexec@node11] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion [20203] DEBUG: Unsetting process local rank [20203] ERROR: failed to unset process local rank
So what's wrong with the Quartz and the basic MPI demo?
This helloworld MPI test codes you can see above , we all think this is just a simple MPI program. So we did not expect such a problem at all. Our intention was to use Quartz to test a program we wrote using MPI, so we first run it with this demo. But now we have this problem. Is there something wrong with this MPI demo or did I use Quartz in a wrong way or wrong configuration?
If there are other ways to solve this problem, or to find out where this problem lies, we are willing to try it because we want to use Quartz to study MPI related aspects.
I have seen many complains about this error when using MVAPICH. I would try with openMPI. You are running the mpirun under the emulator. Le'ts change this strategy: Compile the mpi program (hello) linking with the NVM emulator library and do not use the NVM_EMUL/script/runenv.sh script (with your run.sh). You may need to ser some environment variable such as NVMEMUL_INI, but do not set LD_PRELOAD. After linking the 'hello' program with the emulator library, execute it with with mpirun as any other MPI program.
@Gumi-presentation-by-Dzh I have a question why run you running your application using this command
sh run.sh mpirun -np 4 /home/liao/hello
instead of this mpirun -np 4 run.sh /home/liao/hello
.
Please correct me if I am wrong but the former will run the emulator on one node and application on 4 nodes (assuming you give -host) but later will run the emulator on 4 nodes along with application.
@Gumi-presentation-by-Dzh Also, I tried to sample code you posted here with the emulator and in my case, it worked without getting hung. But I am facing a similar problem when I am running my application PowerGraph with quartz emulator using mpirun on the muliple nodes. Can you please provide any suggestions if you were able to resolve your issue?
@hkundnani Sorry to answer later.
The command I use is: sh run.sh mpirun -np 4 /home/liao/hello
The run.sh is: _INPUT_COMMAN=$@ _LOAD_PATH="/home/ZHduan/HME-q/scripts/setupdev.sh" _RUN_PATH="/home/ZHduan/HME-q/scripts/runenv.sh" _ROOT="sudo" _LOAD="load" _RELOAD="reload" echo ${_INPUT_COMMAN} Init() { ${_ROOT} ${_LOAD_PATH} ${_RELOAD} ${_ROOT} ${_LOAD_PATH} ${_LOAD} } Start() { ${_RUN_PATH} ${_INPUT_COMMAN} } Init Start
You can find more detail in README and "setupdev.sh". When the emulator is running, all application (or command) will be emulated in NVM or hybrid memory. We use sh run.sh mpirun -np 4 /home/liao/hello
to test the application running in multithreading. So as you said, running application on 4 nodes under emulating.
I am very skeptical about whether ‘mpirun -np 4 run.sh /home/liao/hello’ will create 4 Quartz emulators. But based on my own understanding, this does not create 4 emulators to run the test program at the same time. If there are some PS logs, it will tell the situation.
@Gumi-presentation-by-Dzh Thank you for your reply.
I have attached the log for your sample code using both the commands. The log was huge so I have just attached starting lines to convey my point
For sh run.sh mpirun -np 4 -host inv08,inv09,inv11,inv14 /home/liao/hello
I have added hosts becasue in my machine I ran it this way:
scripts/../build/src/lib/libnvmemul.so
scripts/../nvmemul.ini
[109346] DEBUG: [109346] INFO: Detected CPU model 'Haswell Xeon'
[109346] DEBUG: Possible NUMA nodes are 1024
[109346] DEBUG: NUMA nodes allowed are 1024
[109346] DEBUG: NUMA configured CPUs are 40
[109346] INFO: 19 CPUs on physical node 0
[109346] INFO: 20 CPUs on physical node 1
[109346] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[109346] INFO: node: 0, pci addr: 7f:14.0
[109346] INFO: node: 0, pci bus: 0x7f
[109346] INFO: node: 0, pci addr: 7f:14.1
[109346] INFO: node: 0, pci addr: 7f:17.0
[109346] INFO: node: 0, pci addr: 7f:17.1
[109346] INFO: node: 1, pci addr: ff:14.0
[109346] INFO: node: 1, pci bus: 0xff
[109346] INFO: node: 1, pci addr: ff:14.1
[109346] INFO: node: 1, pci addr: ff:17.0
[109346] INFO: node: 1, pci addr: ff:17.1
[109346] INFO: Number of physical nodes 2
[109346] INFO: Number of virtual nodes 1
[109346] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[109346] INFO: measuring latency: latency is 88 ns
[109346] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[109346] INFO: measuring latency: latency is 128 ns
[109346] INFO: Fusing physical nodes 0 1 into virtual node 0
[109346] INFO: Initializing latency model
[109346] DEBUG: setting process local rank for 10 local processes
[109346] DEBUG: process local rank is 0 on system inv08
[109346] DEBUG: number of cpus is 19
[109346] DEBUG: no partitioning of CPUs, set next CPU to vnode 0 and cpu 0
[109346] INFO: Registering thread tid [109346]
[109346] INFO: Binding thread tid [109346] pthread: 0x7f7b4e2267c0 on processor 2
[109346] DEBUG: thread id [109346] last epoch was 85 usec ago
[109346] DEBUG: thread id [109346] last epoch was 150 usec ago
[109346] DEBUG: thread id [109346] last epoch was 177 usec ago
[109346] DEBUG: thread id [109346] last epoch was 180 usec ago
------------------------------------------------------------------------------------------------------------------------
[109346] DEBUG: thread id [109346] last epoch was 9872 usec ago
[109346] DEBUG: thread id [109348] last epoch was 7984 usec ago
[109346] DEBUG: thread id [109346] last epoch was 9926 usec ago
[10934[109349] DEBUG: [109349] INFO: Detected CPU model 'Haswell Xeon'
[109349] DEBUG: Possible NUMA nodes are 1024
[109349] DEBUG: NUMA nodes allowed are 1024
[109349] DEBUG: NUMA configured CPUs are 40
[109349] INFO: 19 CPUs on physical node 0
[109349] INFO: 20 CPUs on physical node 1
[109349] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[109349] INFO: node: 0, pci addr: 7f:14.0
[109349] INFO: node: 0, pci bus: 0x7f
[109349] INFO: node: 0, pci addr: 7f:14.1
[109349] INFO: node: 0, pci addr: 7f:17.0
[109349] INFO: node: 0, pci addr: 7f:17.1
[109349] INFO: node: 1, pci addr: ff:14.0
[109349] INFO: node: 1, pci bus: 0xff
[109349] INFO: node: 1, pci addr: ff:14.1
[109349] INFO: node: 1, pci addr: ff:17.0
[109349] INFO: node: 1, pci addr: ff:17.1
[109349] INFO: Number of physical nodes 2
[109349] INFO: Number of virtual nodes 1
[109349] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[109349] INFO: measuring latency: latency is 90 ns
[109349] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[109349] INFO: measuring latency: latency is 133 ns
[109349] INFO: Fusing physical nodes 0 1 into virtual node 0
[109349] INFO: Initializing latency model
[109349] DEBUG: setting process local rank for 10 local processes
[109349] DEBUG: read from file current max rank 1
[109349] DEBUG: write to file new max rank 2
[109349] DEBUG: process local rank is 1 on system inv08
[109349] DEBUG: number of cpus is 19
[109349] DEBUG: no partitioning of CPUs, set next CPU to vnode 0 and cpu 2
[109349] INFO: Registering thread tid [109349]
[109349] INFO: Binding thread tid [109349] pthread: 0x7ff2f6d778c0 on processor 4
[109349] DEBUG: thread id [109349] last epoch was 371 usec ago
[109349] DEBUG: thread id [109349] last epoch was 435 usec ago
[109349] DEBUG: thread id [109349] last epoch was 493 usec ago
-------------------------------------------------------------------------------------------------------------------------
[109346] DEBUG: thread id [109346] last epoch was 9690 usec ago
[109346] DEBUG: thread id [109348] last epoch was 7782 usec ago
[109346] DEBUG: thread id [109346] last epoch was 9744 u[109350] DEBUG: [109350] INFO: Detected CPU model 'Haswell Xeon'
[109350] DEBUG: Possible NUMA nodes are 1024
[109350] DEBUG: NUMA nodes allowed are 1024
[109350] DEBUG: NUMA configured CPUs are 40
[109350] INFO: 19 CPUs on physical node 0
[109350] INFO: 20 CPUs on physical node 1
[109350] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[109350] INFO: node: 0, pci addr: 7f:14.0
[109350] INFO: node: 0, pci bus: 0x7f
[109350] INFO: node: 0, pci addr: 7f:14.1
[109350] INFO: node: 0, pci addr: 7f:17.0
[109350] INFO: node: 0, pci addr: 7f:17.1
[109350] INFO: node: 1, pci addr: ff:14.0
[109350] INFO: node: 1, pci bus: 0xff
[109350] INFO: node: 1, pci addr: ff:14.1
[109350] INFO: node: 1, pci addr: ff:17.0
[109350] INFO: node: 1, pci addr: ff:17.1
[109350] INFO: Number of physical nodes 2
[109350] INFO: Number of virtual nodes 1
[109350] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[109350] INFO: measuring latency: latency is 89 ns
[109350] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[109350] INFO: measuring latency: latency is 135 ns
[109350] INFO: Fusing physical nodes 0 1 into virtual node 0
[109350] INFO: Initializing latency model
[109350] DEBUG: setting process local rank for 10 local processes
[109350] DEBUG: read from file current max rank 2
[109350] DEBUG: write to file new max rank 3
[109350] DEBUG: process local rank is 2 on system inv08
[109350] DEBUG: number of cpus is 19
[109350] DEBUG: no partitioning of CPUs, set next CPU to vnode 0 and cpu 4
[109350] INFO: Registering thread tid [109350]
[109350] INFO: Binding thread tid [109350] pthread: 0x7f9104f048c0 on processor 6
[109350] DEBUG: thread id [109350] last epoch was 65 usec ago
[109350] DEBUG: thread id [109350] last epoch was 125 usec ago
[109350] DEBUG: thread id [109350] last epoch was 182 usec ago
----------------------------------------------------------------------------------------------------------------------------
[109346] DEBUG: thread id [109346] last epoch was 6781 usec ago
[109346] DEBUG: thread id [109346] last epoch was 6782 usec ago
[109346] DEBUG: thread i[109353] DEBUG: [109353] INFO: Detected CPU model 'Haswell Xeon'
[109353] DEBUG: Possible NUMA nodes are 1024
[109353] DEBUG: NUMA nodes allowed are 1024
[109353] DEBUG: NUMA configured CPUs are 40
[109353] INFO: 19 CPUs on physical node 0
[109353] INFO: 20 CPUs on physical node 1
[109353] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[109353] INFO: node: 0, pci bus: 0x7f
[109353] INFO: node: 0, pci addr: 7f:14.1
[109353] INFO: node: 0, pci addr: 7f:17.0
[109353] INFO: node: 0, pci addr: 7f:17.1
[109353] INFO: node: 1, pci addr: ff:14.0
[109353] INFO: node: 1, pci bus: 0xff
[109353] INFO: node: 1, pci addr: ff:14.1
[109353] INFO: node: 1, pci addr: ff:17.0
[109353] INFO: node: 1, pci addr: ff:17.1
[109353] INFO: Number of physical nodes 2
[109353] INFO: Number of virtual nodes 1
[109353] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
d [109346] last epoch was 7094 usec ago
[109346] DEBUG: thread id [109346] last epoch was 7103 usec ago
[109346] DEBUG: thread id [109346] last epoch was 7103 usec ago
[109346] DEBUG: thread id [109348] last epoch was 7060 usec ago
The ouput for above command is shown below and which I think is the intended output. But if you see in the log it detects four processors Haswell Xeon and the emulation also happened on each node Also, I can see four process id in the log (109346, 109349, 109350, 109353).
Hello World! Process 0 of 4 on inv08
Hello World! Process 1 of 4 on inv09
Hello World! Process 2 of 4 on inv11
Hello World! Process 3 of 4 on inv14
On the other hand for this command mpirun -np 4 -host inv08,inv09,inv11,inv14 run.sh /home/liao/hello
:
scripts/../build/src/lib/libnvmemul.so
scripts/../nvmemul.ini
scripts/../build/src/lib/libnvmemul.so
scripts/../nvmemul.ini
scripts/../build/src/lib/libnvmemul.so
scripts/../nvmemul.ini
scripts/../build/src/lib/libnvmemul.so
scripts/../nvmemul.ini
[125020] DEBUG: [125020] INFO: Detected CPU model 'Haswell Xeon'
[125020] DEBUG: Possible NUMA nodes are 1024
[125020] DEBUG: NUMA nodes allowed are 1024
[125020] DEBUG: NUMA configured CPUs are 40
[125020] INFO: 20 CPUs on physical node 0
[125020] INFO: 20 CPUs on physical node 1
[125020] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[125020] INFO: node: 0, pci addr: 7f:14.0
[125020] INFO: node: 0, pci bus: 0x7f
[125020] INFO: node: 0, pci addr: 7f:14.1
[125020] INFO: node: 0, pci addr: 7f:17.0
[125020] INFO: node: 0, pci addr: 7f:17.1
[125020] INFO: Number of physical nodes 2
[125020] INFO: Number of virtual nodes 1
[125020] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[10121] DEBUG: [10121] INFO: Detected CPU model 'Haswell Xeon'
[10121] DEBUG: Possible NUMA nodes are 1024
[10121] DEBUG: NUMA nodes allowed are 1024
[10121] DEBUG: NUMA configured CPUs are 40
[26005] DEBUG: [26005] INFO: Detected CPU model 'Haswell Xeon'
[10121] INFO: 20 CPUs on physical node 0
[26005] DEBUG: Possible NUMA nodes are 1024
[26005] DEBUG: NUMA nodes allowed are 1024
[26005] DEBUG: NUMA configured CPUs are 40
[10121] INFO: 20 CPUs on physical node 1
[26005] INFO: 20 CPUs on physical node 0
[10121] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[26005] INFO: 20 CPUs on physical node 1
[10121] INFO: node: 0, pci addr: 7f:14.0
[10121] INFO: node: 0, pci bus: 0x7f
[10121] INFO: node: 0, pci addr: 7f:14.1
[10121] INFO: node: 0, pci addr: 7f:17.0
[10121] INFO: node: 0, pci addr: 7f:17.1
[10121] INFO: Number of physical nodes 2
[10121] INFO: Number of virtual nodes 1
[26005] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[26005] INFO: Number of physical nodes 2
[26005] INFO: Number of virtual nodes 1
[10121] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[26005] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[109372] DEBUG: [109372] INFO: Detected CPU model 'Haswell Xeon'
[109372] DEBUG: Possible NUMA nodes are 1024
[109372] DEBUG: NUMA nodes allowed are 1024
[109372] DEBUG: NUMA configured CPUs are 40
[109372] INFO: 19 CPUs on physical node 0
[109372] INFO: 20 CPUs on physical node 1
[109372] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[109372] INFO: node: 0, pci addr: 7f:14.0
[109372] INFO: node: 0, pci bus: 0x7f
[109372] INFO: node: 0, pci addr: 7f:14.1
[109372] INFO: node: 0, pci addr: 7f:17.0
[109372] INFO: node: 0, pci addr: 7f:17.1
[109372] INFO: node: 1, pci addr: ff:14.0
[109372] INFO: node: 1, pci bus: 0xff
[109372] INFO: node: 1, pci addr: ff:14.1
[109372] INFO: node: 1, pci addr: ff:17.0
[109372] INFO: node: 1, pci addr: ff:17.1
[109372] INFO: Number of physical nodes 2
[109372] INFO: Number of virtual nodes 1
[109372] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[125020] INFO: measuring latency: latency is 82 ns
[125020] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[26005] INFO: measuring latency: latency is 83 ns
[26005] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[10121] INFO: measuring latency: latency is 83 ns
[10121] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[109372] INFO: measuring latency: latency is 90 ns
[109372] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[125020] INFO: measuring latency: latency is 125 ns
[125020] INFO: Fusing physical nodes 0 1 into virtual node 0
[125020] INFO: Initializing latency model
[125020] DEBUG: setting process local rank for 10 local processes
[125020] DEBUG: process local rank is 0 on system inv09
[125020] DEBUG: number of cpus is 20
[125020] DEBUG: partitioning CPUS, this process has CPUs from 0 and 1
[125020] DEBUG: disabling CPU 4
[125020] DEBUG: disabling CPU 6
[125020] DEBUG: disabling CPU 8
[125020] DEBUG: disabling CPU 10
[125020] DEBUG: disabling CPU 12
[125020] DEBUG: disabling CPU 14
[125020] DEBUG: disabling CPU 16
[125020] DEBUG: disabling CPU 18
[125020] DEBUG: disabling CPU 20
[125020] DEBUG: disabling CPU 22
[125020] DEBUG: disabling CPU 24
[125020] DEBUG: disabling CPU 26
[125020] DEBUG: disabling CPU 28
[125020] DEBUG: disabling CPU 30
[125020] DEBUG: disabling CPU 32
[125020] DEBUG: disabling CPU 34
[125020] DEBUG: disabling CPU 36
[125020] DEBUG: disabling CPU 38
[125020] INFO: Registering thread tid [125020]
[125020] INFO: Binding thread tid [125020] pthread: 0x7f709f5d9840 on processor 0
[125020] DEBUG: thread id [125020] last epoch was 109 usec ago
[125020] DEBUG: thread id [125020] last epoch was 176 usec ago
[125020] DEBUG: thread id [125020] last epoch was 234 usec ago
[125020] DEBUG: thread id [125020] last epoch was 292 usec ago
[125020] DEBUG: thread id [125020] last epoch was 349 usec ago
[125020] DEBUG: thread id [125020] last epoch was 407 usec ago
[125020] DEBUG: thread id [125020] last epoch was 464 usec ago
The ouput for above command is shown below and which is same as above output. If you see in the log it detects four processors of four different machines Haswell Xeon and the emulation on all four nodes and I can also see four different processes in the above command with process id as 125020, 26005, 10121, 109372.
Hello World! Process 0 of 4 on inv08
Hello World! Process 1 of 4 on inv09
Hello World! Process 2 of 4 on inv11
Hello World! Process 3 of 4 on inv14
So based on first log and second log the only difference i feel is libnvmemul.so and nvmemul.ini file loded once in first case and four times in second case. Can you please share your observation based on these logs. Also, what do you think is the difference between both cases.
@hkundnani
Overall, the results are not much different. As you can see in my 'run.sh', we use './scripts/runenv.sh' to run the emulator. In 'runenv.sh', at line 45 and 46, it sets the environment variables of Quartz. This is why you can see scripts/../build/src/lib/libnvmemul.so scripts/../nvmemul.ini
four times. ('runenv.sh line 53 and 54').
In my 'run.sh', we first reload the Quartz's lib then run the 'runenv.sh'. In the logs your provide, the emulator initialization is coming above the test running. So we guess that in the end there is only one emulator at work.
In Quartz's ReadME file, you can see the way to emulating you program without './scripts/runenv.sh'. You just need to load the Quartz's lib and set the environment variables we mentioned in './scripts/runenv.sh'. Then, all the commands you run will be executed in the simulation environment.
@Gumi-presentation-by-Dzh
I saw the way to emulate the program without using './scripts/runenv.sh' and was able to run your sample code without any issue. I tried a few other applications like GraphMat and Gemini and was able to run those too without any issues on the emulator. Only PowerGraph gets hung at the certain code which uses MPI_Barrier inside a loop. Maybe I will need to go deeper into the code.
Thank you for replying to such an old thread.
@hkundnani If you find more information about MPI_Barrier or solve this problem, let us know.
I export the EMUL_LOCAL_PROCESSES environment variable with the number or emulated processes on the host. And I also choice the NVM model and DRAM+NVM model to run the MPI programs and parsec benchmark for multiple programs. But I can't get the answer.
There is fatal error in MPI_Finalize , I don't know how to use this to test the multiple programs. So how to deal with this problem.