HewlettPackard / quartz

Quartz: A DRAM-based performance emulator for NVM
https://github.com/HewlettPackard/quartz
Other
159 stars 66 forks source link

Failed in Multiple emulated processes and MPI programs #20

Open Gumi-presentation-by-Dzh opened 6 years ago

Gumi-presentation-by-Dzh commented 6 years ago

I export the EMUL_LOCAL_PROCESSES environment variable with the number or emulated processes on the host. And I also choice the NVM model and DRAM+NVM model to run the MPI programs and parsec benchmark for multiple programs. But I can't get the answer.

There is fatal error in MPI_Finalize , I don't know how to use this to test the multiple programs. So how to deal with this problem.

Gumi-presentation-by-Dzh commented 6 years ago

What should I set when I test the MPI programs ? except EMUL_LOCAL_PROCESSES environment variable

Gumi-presentation-by-Dzh commented 6 years ago

We try the MPI demo "helloworld" but there is fatal error in MPI_Finalize. Did we need to set something when we test MPI programs?

Gumi-presentation-by-Dzh commented 6 years ago

Here is the test source.

include "mpi.h"

include

include

int main (int argc, char **argv) { int myid, numprocs; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &myid); MPI_Comm_size (MPI_COMM_WORLD, &numprocs); MPI_Get_processor_name (processor_name, &namelen); fprintf (stderr, "Hello World! Process %d of %d on %s\n", myid, numprocs, processor_name); MPI_Finalize (); return 0; }

guimagalhaes commented 6 years ago

Hi @Gumi-presentation-by-Dzh, How many CPUs and cores per CPU do you have. What is the value of EMUL_LOCAL_PROCESSES you are setting and how many MPI processes/ranks you are spawning? Which MPI library and version you are using?

The best configuration is to configure EMUL_LOCAL_PROCESSES with the same number of processes spawn by the MPI program. The recommended configuration for the emulator is to spawn the same number of cores of a single CPU chip. Also try with different MPI libraries.

Please enable debug level to 5 and send the entire log.

Gumi-presentation-by-Dzh commented 6 years ago

We have two CPUS and 10 cores per CPU. We spawn 4 MPI processes and We set EMUL_LOCAL_PROCESSES as 4 like you mentioned. But I already export the EMUL_LOCAL_PROCESSES as 4 and the debug said No EMUL_LOCAL_PROCESSES variable set ,skipping rank setting. Why?

guimagalhaes commented 6 years ago

Did you export this variable in the same shell you execute the MPI program? $ export EMUL_LOCAL_PROCESSES=4

Gumi-presentation-by-Dzh commented 6 years ago

Yes, I did it again, it said

[13869] DEBUG: setting process local rank for 4 local processes [13869] ERROR: failed to set process local rank

Gumi-presentation-by-Dzh commented 6 years ago

[14224] DEBUG: [14224] INFO: Detected CPU model 'Haswell Xeon' [14224] DEBUG: Possible NUMA nodes are 1024 [14224] DEBUG: NUMA nodes allowed are 1024 [14224] DEBUG: NUMA configured CPUs are 20 [14224] INFO: 10 CPUs on physical node 0 [14224] INFO: 10 CPUs on physical node 1 [14224] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus [14224] INFO: node: 1, pci addr: 7f:14.0 [14224] INFO: node: 1, pci bus: 0x7f [14224] INFO: node: 1, pci addr: 7f:14.1 [14224] INFO: node: 1, pci addr: 7f:17.0 [14224] INFO: node: 1, pci addr: 7f:17.1 [14224] WARNING: No complete memory-controller pci topology found in /tmp/mc_pci_bus [14224] INFO: Number of physical nodes 2 [14224] INFO: Number of virtual nodes 1 [14224] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0 [14224] INFO: measuring latency: latency is 81 ns [14224] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1 [14224] INFO: measuring latency: latency is 122 ns [14224] INFO: Fusing physical nodes 0 1 into virtual node 0 [14224] INFO: Initializing latency model [14224] DEBUG: setting process local rank for 4 local processes [14224] ERROR: failed to set process local rank

guimagalhaes commented 6 years ago

Please remove manually "/tmp/emul_lock_file" and "/tmp/emul_process_local_rank" files. The lock file is preventing the new process to set its rank. When the processes crash, they may leave these files behind. Please clean them up manually.

Gumi-presentation-by-Dzh commented 6 years ago

I remove these two files but still can't work well, the MPI progrem has no output and can not end

The debug message is

[14601] DEBUG: [14601] INFO: Detected CPU model 'Haswell Xeon' [14601] DEBUG: Possible NUMA nodes are 1024 [14601] DEBUG: NUMA nodes allowed are 1024 [14601] DEBUG: NUMA configured CPUs are 20 [14601] INFO: 10 CPUs on physical node 0 [14601] INFO: 10 CPUs on physical node 1 [14601] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus [14601] INFO: node: 1, pci addr: 7f:14.0 [14601] INFO: node: 1, pci bus: 0x7f [14601] INFO: node: 1, pci addr: 7f:14.1 [14601] INFO: node: 1, pci addr: 7f:17.0 [14601] INFO: node: 1, pci addr: 7f:17.1 [14601] INFO: Number of physical nodes 2 [14601] INFO: Number of virtual nodes 1 [14601] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0 [14601] INFO: measuring latency: latency is 87 ns [14601] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1 [14601] INFO: measuring latency: latency is 141 ns [14601] INFO: Fusing physical nodes 0 1 into virtual node 0 [14601] INFO: Initializing latency model [14601] DEBUG: setting process local rank for 4 local processes [14599] ERROR: failed to set process local rank [14601] ERROR: failed to set process local rank [14600] DEBUG: overhead cycles: 43698481; immediate overhead 43026143; stall cycles: 0; delay cycles: 0 [14600] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead [14600] DEBUG: thread id [14600] last epoch was 9 usec ago

Gumi-presentation-by-Dzh commented 6 years ago

As we mentioned above , the MPI program is no longer running down, and the DEBUG information of the simulator is not outputted.

guimagalhaes commented 6 years ago

@Gumi-presentation-by-Dzh, Can you track which process is still running? Use 'ps aux | grep ' to track the processes. MPI_finalize() was called? I assume the MPI program works when not running with the NVM latency emulator.

Try different number of MPI processes and also remove the "EMUL_LOCAL_PROCESSES" variable for now.

Gumi-presentation-by-Dzh commented 6 years ago

here is the track: root 13866 0.0 1.1 590716 546052 ? Ssl 21:17 0:02 /home/liao/hello root 14049 0.0 1.1 590716 546048 ? Ssl 21:19 0:02 /home/liao/hello root 14050 0.0 1.1 590716 546052 ? Ssl 21:19 0:02 /home/liao/hello root 14051 3.3 1.1 590716 546052 ? Ssl 21:19 3:45 /home/liao/hello root 14052 0.0 1.1 590716 546052 ? Ssl 21:19 0:02 /home/liao/hello root 14600 0.0 1.1 590716 546056 ? Ssl 21:28 0:02 /home/liao/hello root 15320 2.8 1.1 590716 546052 ? Ssl 23:12 0:02 /home/liao/hello root 15321 2.8 1.1 590716 546052 ? Ssl 23:12 0:02 /home/liao/hello root 15322 6.0 1.1 590716 546052 ? Ssl 23:12 0:06 /home/liao/hello root 15323 6.0 1.1 590716 546052 ? Ssl 23:12 0:06 /home/liao/hello root 15429 0.0 0.0 113048 1432 pts/0 S+ 23:13 0:00 sh run.sh mpirun -np 4 /home/liao/hello root 15457 0.0 0.0 113048 1444 pts/0 S+ 23:13 0:00 /bin/bash /home/ZHduan/HME-q/scripts/runenv.sh mpirun -np 4 /home/liao/hello root 15492 8.7 1.1 587488 545692 pts/0 Sl+ 23:13 0:03 mpirun -np 4 /home/liao/hello root 15500 7.2 1.1 590716 546056 ? Ssl 23:13 0:02 /home/liao/hello liao 15509 0.0 0.0 112588 948 pts/1 S+ 23:13 0:00 grep --color=auto hello

here is the top message: 15322 root 20 0 590716 546052 1648 S 3.7 1.1 0:09.55 hello
13863 root 20 0 585948 545992 1240 S 3.3 1.1 3:56.19 hydra_pmi_proxy
14045 root 20 0 585948 545520 1232 S 3.3 1.1 3:47.16 hydra_pmi_proxy
14051 root 20 0 590716 546052 1648 S 3.3 1.1 3:49.19 hello
15323 root 20 0 590716 546052 1648 S 3.3 1.1 0:09.61 hello
15492 root 20 0 587488 545692 1364 S 3.3 1.1 0:07.14 mpirun
15494 root 20 0 585948 546004 1240 S 3.3 1.1 0:06.96 hydra_pmi_proxy
14596 root 20 0 585948 545988 1240 S 3.0 1.1 3:30.57 hydra_pmi_proxy
15317 root 20 0 585948 545616 1232 S 3.0 1.1 0:09.07 hydra_pmi_proxy

You can see the hello test progrem is still running. And the debug message is: [15500] DEBUG: thread id [15500] reached max epoch duration (1000 usec) [15500] DEBUG: interrupting thread [15500] [15500] DEBUG: Handling interrupt thread [15500] pthread: 0x7f3586b4cc00 [15500] DEBUG: thread id [15500] last epoch was 1050 usec ago [15500] DEBUG: thread id [15500] reached min epoch duration (1000 usec) [15500] DEBUG: read stall L2 cycles diff 0; llc_hit 0; cycles diff remote_dram 0; local_dram 0 [15500] DEBUG: overhead cycles: 963331; immediate overhead 12227; stall cycles: 0; delay cycles: 0 [15500] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead [15500] DEBUG: thread id [15500] last epoch was 40 usec ago [15500] DEBUG: thread id [15500] last epoch was 96 usec ago [15498] ERROR: failed to set process local rank

Yes, we can ran this MPI program without the NVM latency emulator but we want to test the MPI progrem in MVN. So is there someting should to be set as we run the emulator?

guimagalhaes commented 6 years ago

It seems many more processes are running than the expected number. Can you tell why? Kill all processes related to the MPI program or restart the system. In the case above, I would expect to see process id '15500' in the 'ps' output.

It seems MPI_finalize() is hang. You can debug your program and confirm that the MPI_Finalize() function is called and never returns. Probably some MPI process could not reach MPI_finalize yet. You need to track all expected processes were started and didn't crash. If any crashed, we need to get the stack trace.

Gumi-presentation-by-Dzh commented 6 years ago

I restart the system and run MPI program again. But there something weired.

I track the progress: [ZHduan@node11 ~]$ ps aux | grep hello root 2743 0.0 0.0 113048 1428 pts/0 S+ 10:24 0:00 sh run.sh mpirun -np 4 /home/liao/hello root 2771 0.0 0.0 113048 1448 pts/0 S+ 10:24 0:00 /bin/bash /home/ZHduan/HME-q/scripts/runenv.sh mpirun -np 4 /home/liao/hello root 2868 29.9 1.1 587488 545692 pts/0 Rl+ 10:24 1:18 mpirun -np 4 /home/liao/hello root 2872 1.1 0.0 0 0 ? Zs 10:25 0:02 [hello] root 2873 1.1 0.0 0 0 ? Zs 10:25 0:02 [hello] root 2874 9.1 1.1 590716 546052 ? Ssl 10:25 0:23 /home/liao/hello root 2875 1.1 0.0 0 0 ? Zs 10:25 0:02 [hello] ZHduan 3022 0.0 0.0 112588 948 pts/1 S+ 10:29 0:00 grep --color=auto hello

This time seems like there is just one process of hello is running.

And the top message is: 2870 root 20 0 585948 545996 1240 R 97.3 1.1 3:36.59 hydra_pmi_proxy
2691 root 20 0 148364 5704 4204 R 30.9 0.0 1:08.25 sshd
2868 root 20 0 587488 545692 1368 S 26.9 1.1 1:12.74 mpirun
135 root 20 0 0 0 0 S 9.0 0.0 0:20.98 kworker/0:1
2874 root 20 0 590716 546052 1652 S 8.6 1.1 0:21.17 hello

But the Quartz's debug message is infinite loop to output message, and I interrupt and I got this debug messages: [2870] INFO: Registering thread tid [2870] [2870] INFO: Binding thread tid [2870] pthread: 0x7fb26124e7c0 on processor 2 [2870] DEBUG: thread id [2870] last epoch was 197051271 usec ago [2870] DEBUG: thread id [2870] reached max epoch duration (1000 usec) [2870] DEBUG: interrupting thread [2870] [2870] DEBUG: Handling interrupt thread [2870] pthread: 0x7fb26124e7c0 [2870] DEBUG: thread id [2870] last epoch was 197051287 usec ago [2870] DEBUG: thread id [2870] reached min epoch duration (1000 usec) [2870] DEBUG: read stall L2 cycles diff 105501; llc_hit 591; cycles diff remote_dram 66; local_dram 228 [2870] DEBUG: overhead cycles: 19188; immediate overhead 19188; stall cycles: 24814; delay cycles: 19895 [2870] DEBUG: injecting delay of 707 cycles (0 usec) - discounted overhead [2870] DEBUG: thread id [2870] last epoch was 34 usec ago [2870] DEBUG: thread id [2870] last epoch was 84 usec ago [2870] DEBUG: thread id [2870] last epoch was 134 usec ago [2870] DEBUG: thread id [2870] last epoch was 184 usec ago [2870] DEBUG: thread id [2870] last epoch was 234 usec ago [2870] DEBUG: thread id [2870] last epoch was 284 usec ago [2870] DEBUG: thread id [2870] last epoch was 334 usec ago [2870] DEBUG: thread id [2870] last epoch was 384 usec ago [2870] DEBUG: thread id [2870] last epoch was 434 usec ago [2870] DEBUG: thread id [2870] last epoch was 484 usec ago [2870] DEBUG: thread id [2870] last epoch was 534 usec ago [2870] DEBUG: thread id [2870] last epoch was 584 usec ago [2870] DEBUG: thread id [2870] last epoch was 633 usec ago [2870] DEBUG: thread id [2870] last epoch was 683 usec ago [2870] DEBUG: thread id [2870] last epoch was 733 usec ago [2870] DEBUG: thread id [2870] last epoch was 783 usec ago [2870] DEBUG: thread id [2870] last epoch was 833 usec ago [2870] DEBUG: thread id [2870] last epoch was 883 usec ago [2870] DEBUG: thread id [2870] last epoch was 932 usec ago [2870] DEBUG: thread id [2870] last epoch was 982 usec ago [2870] DEBUG: thread id [2870] last epoch was 1043 usec ago [2870] DEBUG: thread id [2870] reached max epoch duration (1000 usec) [2870] DEBUG: interrupting thread [2870] [2870] DEBUG: Unsetting process local rank [mpiexec@node11] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@node11] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@node11] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion [mpiexec@node11] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion [2868] DEBUG: Unsetting process local rank [2868] ERROR: failed to unset process local rank

Here is the message I got from dmesg: [ 197.781592] nvmemul: module verification failed: signature and/or required key missing - tainting kernel [ 197.781966] nvmemul: Loading. Initializing... [ 197.781972] nvmemul: major is 246 [ 200.663357] nvmemul: ioctl command: 0x8008aa00 [ 200.663422] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 200.663430] nvmemul: ioctl command: 0x8008aa00 [ 200.663450] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 200.663454] nvmemul: ioctl command: 0x8008aa00 [ 200.663482] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 200.663486] nvmemul: ioctl command: 0x8008aa00 [ 200.663506] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3 [ 202.961989] nvmemul: ioctl command: 0x8008aa00 [ 202.962049] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 202.962057] nvmemul: ioctl command: 0x8008aa00 [ 202.962076] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 202.962081] nvmemul: ioctl command: 0x8008aa00 [ 202.962100] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 202.962105] nvmemul: ioctl command: 0x8008aa00 [ 202.962124] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3 [ 204.826528] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs [ 206.158277] nvmemul: ioctl command: 0x8008aa00 [ 206.158339] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 206.158347] nvmemul: ioctl command: 0x8008aa00 [ 206.158369] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 206.158374] nvmemul: ioctl command: 0x8008aa00 [ 206.158396] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 206.158400] nvmemul: ioctl command: 0x8008aa00 [ 206.158422] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3 [ 206.180462] nvmemul: ioctl command: 0x8008aa00 [ 206.180514] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 206.180539] nvmemul: ioctl command: 0x8008aa00 [ 206.180561] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 206.180604] nvmemul: ioctl command: 0x8008aa00 [ 206.180624] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 206.180644] nvmemul: ioctl command: 0x8008aa00 [ 206.180663] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3 [ 206.182356] nvmemul: ioctl command: 0x8008aa00 [ 206.182381] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 206.182388] nvmemul: ioctl command: 0x8008aa00 [ 206.182408] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 206.182413] nvmemul: ioctl command: 0x8008aa00 [ 206.182432] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 206.182436] nvmemul: ioctl command: 0x8008aa00 [ 206.182456] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3 [ 206.185735] nvmemul: ioctl command: 0x8008aa00 [ 206.185759] nvmemul: setcounter counter_id: 0x0 event_id=0x55305a3 [ 206.185766] nvmemul: ioctl command: 0x8008aa00 [ 206.185785] nvmemul: setcounter counter_id: 0x1 event_id=0x5308d2 [ 206.185789] nvmemul: ioctl command: 0x8008aa00 [ 206.185808] nvmemul: setcounter counter_id: 0x2 event_id=0x530cd3 [ 206.185812] nvmemul: ioctl command: 0x8008aa00 [ 206.185830] nvmemul: setcounter counter_id: 0x3 event_id=0x5303d3

So what happened when I run MPI progress When I use Quartz.

The command I use is: sh run.sh mpirun -np 4 /home/liao/hello

The run.sh is: _INPUT_COMMAN=$@ _LOAD_PATH="/home/ZHduan/HME-q/scripts/setupdev.sh" _RUN_PATH="/home/ZHduan/HME-q/scripts/runenv.sh" _ROOT="sudo" _LOAD="load" _RELOAD="reload" echo ${_INPUT_COMMAN} Init() { ${_ROOT} ${_LOAD_PATH} ${_RELOAD} ${_ROOT} ${_LOAD_PATH} ${_LOAD} } Start() { ${_RUN_PATH} ${_INPUT_COMMAN} } Init Start

The hello test is:

include "mpi.h"

include

include

int main (int argc, char **argv) { int myid, numprocs; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &myid); MPI_Comm_size (MPI_COMM_WORLD, &numprocs); MPI_Get_processor_name (processor_name, &namelen); fprintf (stderr, "Hello World! Process %d of %d on %s\n", myid, numprocs, processor_name); MPI_Finalize (); return 0; }

guimagalhaes commented 6 years ago

The dmesg output seems correct, I can see 6 processes attempting to enable the required performance counters.

In the 'nvmemul.ini' file do you have 'bandwidth.enable' set to false? Please make sure the bandwidth modeling is disabled.

When you started the mpi program, it spawn 4 processes as expected (pids 2872, 2873, 2874 and 2875). You can verify the processes tree by running 'ps axjf'. It will shown the main MPI process and its children processes.

Attemp to run the mpi program with just one process and let's see what happens. Attempt to run the MPI program with both bandwidth and latency emulation turned of in the nvmemul.ini.

When you run the MPI program with the NVM emulator, some process is hung. If this process doesn't return the main MPI process is not finished. We need to understand which process is hung and why. This will gives the direction to follow.

The files "/tmp/emul_lock_file" and "/tmp/emul_process_local_rank" should be automatically deleted when the program terminates unless some process is hung or crashes. Delete these files and do not set the variable EMUL_LOCAL_PROCESSES for now.

The NVM emulator is turning CPU counters on and setting threads COPU affinity. The threads are interrupt and delay is injected. I can see the interruption is handled properly. I can't see yet why the emulator would cause processes to hung indefinitelly.

Also please run any other type of application using the NVM emulator. Let's see if the problem is only related to MPI.

Gumi-presentation-by-Dzh commented 6 years ago

So that's a lot of advice you said , let us list it below:

  1. In the 'nvmemul.ini' file do you have 'bandwidth.enable' set to false? Yes, we use this configuration to get the message above:

    latency:
    {
    enable = true;
    inject_delay = true;
    read = 200 ;
    write = 700;
    max_epoch_duration_us = 1000 ;
    min_epoch_duration_us = 1000 ;
    calibration = false;
    };
    bandwidth:
    {
    enable = false;
    model = "/tmp/bandwidth_model";
    read = 500;
    write = 500;
    };
    topology:
    {
    mc_pci = "/tmp/mc_pci_bus";
    physical_nodes = "0,1";
    hyperthreading = true; # do not use multiple hardware threads per core
    };
    statistics:
    {
    enable = true;
    file = "/tmp/statistics";
    };
    debug:
    {
    # debugging level
    level = 5;
    verbose = 0;
    
    # modules set to True produce debugging output
    module:
    {
        all = true;
    };
    };

    So the bandwidth modeling is disabled.

  2. I run "sh run.sh mpirun -np 4 /home/liao/hello" again , and this time I got the end:

    
    [19078] DEBUG: thread id [19078] reached min epoch duration (1000 usec)
    [19078] DEBUG: read stall L2 cycles diff 5996; llc_hit 910; cycles diff remote_dram 0; local_dram 0
    [19078] DEBUG: overhead cycles: 3972744; immediate overhead 11990; stall cycles: 0; delay cycles: 0
    [19078] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
    [19078] DEBUG: thread id [19078] last epoch was 40 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 96 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 152 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 207 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 263 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 318 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 373 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 429 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 484 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 540 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 595 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 651 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 706 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 762 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 817 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 873 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 928 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 984 usec ago
    [19078] DEBUG: thread id [19078] last epoch was 1039 usec ago
    [19079] DEBUG: thread id [19079] last epoch was 24 usec ago
    [19079] DEBUG: thread id [19079] last epoch was 80 usec ago
    [19079] DEBUG: thread id [19079] last epoch was 86 usec ago
    [19079] DEBUG: thread id [19079] last epoch was 141 usec ago
    [19079] DEBUG: thread id [19079] last epoch was 197 usec ago
    [19079] DEBUG: thread id [19079] last epoch was 253 usec ago
    [19079] DEBUG: thread id [19079] last epoch was 311 usec ago
    [19079] DEBUG: thread id [19079] last epoch was 367 usec ago
    [19079] INFO: Registering thread tid [19079]
    [19079] INFO: Binding thread tid [19079] pthread: 0x7f26d794ec00 on processor 1
    [19079] DEBUG: thread id [19079] last epoch was 110088716323 usec ago
    [19079] DEBUG: thread id [19079] reached min epoch duration (1000 usec)
    [19079] DEBUG: thread id [19079] last epoch was 110088716331 usec ago
    [19079] DEBUG: thread id [19079] reached min epoch duration (1000 usec)
    [19079] DEBUG: read stall L2 cycles diff 2818664; llc_hit 86660; cycles diff remote_dram 4098; local_dram 7960
    [19079] DEBUG: overhead cycles: 7461678; immediate overhead 14855; stall cycles: 623736; delay cycles: 273725
    [19079] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
    [19079] DEBUG: thread id [19079] last epoch was 6 usec ago
    [19079] DEBUG: thread id [19079] last epoch was 6 usec ago
    hello-in init done
    Hello World! Process 0 of 4 on node11
    Fatal error in MPI_Finalize: Other MPI error, error stack:
    MPI_Finalize(293).....................: MPI_Finalize failed
    MPI_Finalize(213).....................: 
    MPID_Finalize(118)....................: 
    MPIDI_CH3U_VC_WaitForClose(382).......: an error occurred while the device was waiting for all open connections to close
    MPIDI_CH3I_Progress(309)..............: 
    MPIDI_CH3U_Check_for_failed_procs(470): PMI_KVS_Get failed
    hello-in init done
    Hello World! Process 1 of 4 on node11
    Fatal error in MPI_Finalize: Other MPI error, error stack:
    MPI_Finalize(293).....................: MPI_Finalize failed
    MPI_Finalize(213).....................: 
    MPID_Finalize(118)....................: 
    MPIDI_CH3U_VC_WaitForClose(382).......: an error occurred while the device was waiting for all open connections to close
    MPIDI_CH3I_Progress(309)..............: 
    MPIDI_CH3U_Check_for_failed_procs(470): PMI_KVS_Get failed
    hello-in init done
    Hello World! Process 2 of 4 on node11
    Fatal error in MPI_Finalize: Other MPI error, error stack:
    MPI_Finalize(293).....................: MPI_Finalize failed
    MPI_Finalize(213).....................: 
    MPID_Finalize(118)....................: 
    MPIDI_CH3U_VC_WaitForClose(382).......: an error occurred while the device was waiting for all open connections to close
    MPIDI_CH3I_Progress(309)..............: 
    MPIDI_CH3U_Check_for_failed_procs(470): PMI_KVS_Get failed
    hello-in init done
    Hello World! Process 3 of 4 on node11
    Fatal error in MPI_Finalize: Other MPI error, error stack:
    MPI_Finalize(293).....................: MPI_Finalize failed
    MPI_Finalize(213).....................: 
    MPID_Finalize(118)....................: 
    MPIDI_CH3U_VC_WaitForClose(382).......: an error occurred while the device was waiting for all open connections to close
    MPIDI_CH3I_Progress(309)..............: 
    MPIDI_CH3U_Check_for_failed_procs(470): PMI_KVS_Get failed
    0 usec) - discounted overhead
    [19077] DEBUG: thread id [19077] last epoch was 39 usec ago
    [19077] INFO: Registering thread tid [19077]
    [19077] INFO: Binding thread tid [19077] pthread: 0x7f66c3434c00 on processor 1
    [19077] DEBUG: thread id [19077] last epoch was 110088714187 usec ago
    [19077] DEBUG: thread id [19077] reached min epoch duration (1000 usec)
    [19077] DEBUG: thread id [19077] last epoch was 110088714189 usec ago
    [19077] DEBUG: thread id [19077] reached max epoch duration (1000 usec)
    [19077] DEBUG: interrupting thread [19077]
    [19077] DEBUG: thread id [19077] last epoch was 110088714196 usec ago
    [19077] DEBUG: thread id [19077] reached min epoch duration (1000 usec)
    [19077] DEBUG: read stall L2 cycles diff 1021637; llc_hit 68414; cycles diff remote_dram 32; local_dram 2506
    [19077] DEBUG: overhead cycles: 10560146; immediate overhead 57889; stall cycles: 4240; delay cycles: 1817
    [19077] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
    [19077] DEBUG: Handling interrupt thread [19077] pthread: 0x7f66c3434c00
    [19077] DEBUG: thread id [19077] last epoch was 6 usec ago
    [19077] DEBUG: thread id [19077] last epoch was 17 usec ago
    [19077] DEBUG: thread id [19077] last epoch was 25 usec ago
    [19078] DEBUG: thread id [19078] reached max epoch duration (1000 usec)
    [19078] DEBUG: interrupting thread [19078]
    [19080] DEBUG: thread id [19080] last epoch was 59339 usec ago
    [19080] DEBUG: thread id [19080] reached max epoch duration (1000 usec)
    [19080] DEBUG: interrupting thread [19080]
    [19080] DEBUG: Handling interrupt thread [19080] pthread: 0x7fac9237ec00
    [19080] DEBUG: thread id [19080] last epoch was 59517 usec ago
    [19080] DEBUG: thread id [19080] reached min epoch duration (1000 usec)
    [19080] DEBUG: read stall L2 cycles diff 865884; llc_hit 51120; cycles diff remote_dram 0; local_dram 3948
    [19080] DEBUG: overhead cycles: 8390972; immediate overhead 14748; stall cycles: 0; delay cycles: 0
    [19080] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
    [19080] DEBUG: thread id [19080] last epoch was 34 usec ago
    [19080] INFO: Registering thread tid [19080]
    [19080] INFO: Binding thread tid [19080] pthread: 0x7fac9237ec00 on processor 1
    [19080] DEBUG: thread id [19080] last epoch was 110088714354 usec ago
    [19080] DEBUG: thread id [19080] reached min epoch duration (1000 usec)
    [19080] DEBUG: thread id [19080] last epoch was 110088714357 usec ago
    [19080] DEBUG: thread id [19080] last epoch was 110088714364 usec ago
    [19080] DEBUG: thread id [19080] reached min epoch duration (1000 usec)
    [19080] DEBUG: thread id [19080] reached max epoch duration (1000 usec)
    [19080] DEBUG: read stall L2 cycles diff 1144928; llc_hit 69135; cycles diff remote_dram 253; local_dram 2994
    [19080] DEBUG: interrupting thread [19080]
    [19080] DEBUG: overhead cycles: 8411362; immediate overhead 20390; stall cycles: 33569; delay cycles: 14386
    [19080] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
    [19080] DEBUG: Handling interrupt thread [19080] pthread: 0x7fac9237ec00
    [19080] DEBUG: thread id [19080] last epoch was 4 usec ago
    [19080] DEBUG: thread id [19080] last epoch was 14 usec ago

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES


You can see there are some MPI output in the middle of debug message, we can obvious find that MPI_Finalize(293).....................: MPI_Finalize failed
This time we don't set the  EMUL_LOCAL_PROCESSES to 4. And we got this.

Then we try to set  EMUL_LOCAL_PROCESSES to 4 and run the program.  And the program is hung.
The debug message is:
```javascript
[19218] DEBUG: thread id [19218] last epoch was 457 usec ago
[19219] DEBUG: [19219] INFO: Detected CPU model 'Haswell Xeon'
[19219] DEBUG: Possible NUMA nodes are 1024
[19219] DEBUG: NUMA nodes allowed are 1024
[19219] DEBUG: NUMA configured CPUs are 20
[19219] INFO: 10 CPUs on physical node 0
[19219] INFO: 10 CPUs on physical node 1
[19219] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[19219] INFO: node: 1, pci addr: 7f:14.0
[19219] INFO: node: 1, pci bus: 0x7f
[19219] INFO: node: 1, pci addr: 7f:14.1
[19219] INFO: node: 1, pci addr: 7f:17.0
[19219] INFO: node: 1, pci addr: 7f:17.1
[19219] INFO: Number of physical nodes 2
[19219] INFO: Number of virtual nodes 1
[19219] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[19219] INFO: measuring latency: latency is 87 ns
[19219] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[19219] INFO: measuring latency: latency is 141 ns
[19219] INFO: Fusing physical nodes 0 1 into virtual node 0
[19219] INFO: Initializing latency model
[19219] DEBUG: setting process local rank for 4 local processes
[19220] DEBUG: [19220] INFO: Detected CPU model 'Haswell Xeon'
[19220] DEBUG: Possible NUMA nodes are 1024
[19220] DEBUG: NUMA nodes allowed are 1024
[19220] DEBUG: NUMA configured CPUs are 20
[19220] INFO: 10 CPUs on physical node 0
[19220] INFO: 10 CPUs on physical node 1
[19220] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[19220] INFO: node: 1, pci addr: 7f:14.0
[19220] INFO: node: 1, pci bus: 0x7f
[19220] INFO: node: 1, pci addr: 7f:14.1
[19220] INFO: node: 1, pci addr: 7f:17.0
[19220] INFO: node: 1, pci addr: 7f:17.1
[19220] INFO: Number of physical nodes 2
[19220] INFO: Number of virtual nodes 1
[19220] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[19220] INFO: measuring latency: latency is 89 ns
[19220] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[19220] INFO: measuring latency: latency is 140 ns
[19220] INFO: Fusing physical nodes 0 1 into virtual node 0
[19220] INFO: Initializing latency model
[19220] DEBUG: setting process local rank for 4 local processes
[19219] ERROR: failed to set process local rank
[19220] ERROR: failed to set process local rank
[19218] DEBUG: thread id [19218] last epoch was 19690 usec ago
[19218] DEBUG: thread id [19218] reached max epoch duration (1000 usec)
[19218] DEBUG: interrupting thread [19218]
[19218] DEBUG: Handling interrupt thread [19218] pthread: 0x7fd70dc90c00
[19218] DEBUG: thread id [19218] last epoch was 19708 usec ago
[19218] DEBUG: thread id [19218] reached min epoch duration (1000 usec)
[19218] DEBUG: read stall L2 cycles diff 0; llc_hit 0; cycles diff remote_dram 0; local_dram 0
[19218] DEBUG: overhead cycles: 586877; immediate overhead 17048; stall cycles: 0; delay cycles: 0
[19218] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19218] DEBUG: thread id [19218] last epoch was 33 usec ago
[19218] DEBUG: thread id [19218] last epoch was 89 usec ago
[19218] DEBUG: thread id [19218] last epoch was 144 usec ago
[19218] DEBUG: thread id [19218] last epoch was 200 usec ago
[19218] DEBUG: thread id [19218] last epoch was 256 usec ago
[19218] DEBUG: thread id [19218] last epoch was 311 usec ago
[19218] DEBUG: thread id [19218] last epoch was 366 usec ago
[19218] DEBUG: thread id [19218] last epoch was 422 usec ago
[19218] DEBUG: thread id [19218] last epoch was 477 usec ago
[19218] DEBUG: thread id [19218] last epoch was 533 usec ago

Then we use ps axjf and we got this:

    1  1311  1311  1311 ?           -1 Ss       0   0:00 /usr/sbin/sshd -D
 1311 17056 17056 17056 ?           -1 Ss       0   0:00  \_ sshd: ZHduan [priv]
17056 17060 17056 17056 ?           -1 S     1001   0:00  |   \_ sshd: ZHduan@pts/0
17060 17061 17061 17061 pts/0    19149 Ss    1001   0:00  |       \_ -bash
17061 17145 17145 17061 pts/0    19149 S        0   0:00  |           \_ su root
17145 17151 17151 17061 pts/0    19149 S        0   0:00  |               \_ bash
17151 19149 19149 17061 pts/0    19149 S+       0   0:00  |                   \_ sh run.sh mpirun -np 4 /home/liao/hello
19149 19177 19149 17061 pts/0    19149 S+       0   0:00  |                       \_ /bin/bash /home/ZHduan/HME-q/scripts/runenv.sh mpi
19177 19212 19149 17061 pts/0    19149 Sl+      0   0:03  |                           \_ mpirun -np 4 /home/liao/hello
19212 19214 19214 19214 ?           -1 Ssl      0   0:03  |                               \_ /usr/lib64/mpich/bin/hydra_pmi_proxy --con
19214 19218 19218 19218 ?           -1 Ssl      0   0:02  |                                   \_ /home/liao/hello
 1311 19225 19225 19225 ?           -1 Ss       0   0:00  \_ sshd: ZHduan [priv]
19225 19229 19225 19225 ?           -1 S     1001   0:00      \_ sshd: ZHduan@pts/1
19229 19230 19230 19230 pts/1    19325 Ss    1001   0:00          \_ -bash
19230 19283 19283 19230 pts/1    19325 S        0   0:00              \_ su root
19283 19289 19289 19230 pts/1    19325 S        0   0:00                  \_ bash
19289 19325 19325 19230 pts/1    19325 R+       0   0:00                      \_ ps axjf
    1  1312  1312  1312 ?           -1 Ssl      0   0:00 /usr/sbin/rsyslogd -n
    1  1326  1326  1326 ?           -1 Ss       0   0:00 /usr/sbin/atd -f
    1  1330  1330  1330 ?           -1 Ss       0   0:01 /usr/sbin/crond -n
    1  2022  2022  2022 ?           -1 Ss       0   0:00 /usr/libexec/postfix/master -w
 2022  2045  2022  2022 ?           -1 S       89   0:00  \_ qmgr -l -t unix -u
 2022 18841  2022  2022 ?           -1 S       89   0:00  \_ pickup -l -t unix -u
    1  2490  2490  2490 ?           -1 Ss       0   0:01 login -- liao
 2490  2572  2572  2572 tty1      2643 Ss    1000   0:00  \_ -bash
 2572  2637  2637  2572 tty1      2643 S        0   0:00      \_ su
 2637  2643  2643  2572 tty1      2643 S+       0   0:00          \_ bash
    1 19252  1005  1005 ?           -1 Sl       0   0:00 /usr/sbin/abrt-dbus -t133
    1 19284 19284 19284 ?           -1 Ssl      0   0:00 /usr/libexec/fprintd

The dmesg seems like the same of we mentioned above.

  1. Attemp to run the mpi program with just one process and let's see what happens. So we did. I run "sh run.sh mpirun -np 1 /home/liao/hello" without set EMUL_LOCAL_PROCESSES, and the debug message said that:
    [19751] DEBUG: Handling interrupt thread [19751] pthread: 0x7f82c483c7c0
    [19751] DEBUG: thread id [19751] last epoch was 110967543308 usec ago
    [19751] DEBUG: thread id [19751] reached min epoch duration (1000 usec)
    [19751] DEBUG: read stall L2 cycles diff 123392; llc_hit 714; cycles diff remote_dram 83; local_dram 255
    [19751] DEBUG: overhead cycles: 31541; immediate overhead 31541; stall cycles: 31359; delay cycles: 20906
    [19751] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
    [19751] DEBUG: thread id [19751] last epoch was 24 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 81 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 138 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 195 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 252 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 309 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 365 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 421 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 478 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 535 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 591 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 648 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 705 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 761 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 818 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 874 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 930 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 987 usec ago
    [19751] DEBUG: thread id [19751] last epoch was 1043 usec ago
    [19751] DEBUG: thread id [19751] reached max epoch duration (1000 usec)
    [19751] DEBUG: interrupting thread [19751]
    [19753] WARNING: No complete memory-controller pci topology found in /tmp/mc_pci_bus
    [19753] WARNING: No EMUL_LOCAL_PROCESSES variable set, skipping rank setting

    And program hung, So we set EMUL_LOCAL_PROCESSES to 1. And this time it works. The debug message is:

Turbo Boost disabled for all CPUs
/home/ZHduan/HME-q/scripts/../build/src/lib/libnvmemul.so
/home/ZHduan/HME-q/scripts/../nvmemul.ini
[19940] DEBUG: [19940] INFO: Detected CPU model 'Haswell Xeon'
[19940] DEBUG: Possible NUMA nodes are 1024
[19940] DEBUG: NUMA nodes allowed are 1024
[19940] DEBUG: NUMA configured CPUs are 20
[19940] INFO: 10 CPUs on physical node 0
[19940] INFO: 10 CPUs on physical node 1
[19940] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[19940] INFO: node: 1, pci addr: 7f:14.0
[19940] INFO: node: 1, pci bus: 0x7f
[19940] INFO: node: 1, pci addr: 7f:14.1
[19940] INFO: node: 1, pci addr: 7f:17.0
[19940] INFO: node: 1, pci addr: 7f:17.1
[19940] WARNING: No complete memory-controller pci topology found in /tmp/mc_pci_bus
[19940] INFO: Number of physical nodes 2
[19940] INFO: Number of virtual nodes 1
[19940] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[19940] INFO: measuring latency: latency is 81 ns
[19940] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[19940] INFO: measuring latency: latency is 122 ns
[19940] INFO: Fusing physical nodes 0 1 into virtual node 0
[19940] INFO: Initializing latency model
[19940] WARNING: EMUL_PROCESSES_PER_SYSTEM value is 1, skipping rank setting
[19940] INFO: Registering thread tid [19940]
[19940] INFO: Binding thread tid [19940] pthread: 0x7fa3c9c6c7c0 on processor 0
[19940] DEBUG: thread id [19940] last epoch was 111177225411 usec ago
[19940] DEBUG: thread id [19940] reached max epoch duration (1000 usec)
[19940] DEBUG: interrupting thread [19940]
[19940] DEBUG: Handling interrupt thread [19940] pthread: 0x7fa3c9c6c7c0
[19940] DEBUG: thread id [19940] last epoch was 111177225471 usec ago
[19940] DEBUG: thread id [19940] reached min epoch duration (1000 usec)
[19940] DEBUG: read stall L2 cycles diff 129025; llc_hit 1025; cycles diff remote_dram 61; local_dram 237
[19940] DEBUG: overhead cycles: 30621; immediate overhead 30621; stall cycles: 24169; delay cycles: 15452
[19940] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19940] DEBUG: thread id [19940] last epoch was 27 usec ago
[19940] DEBUG: thread id [19940] last epoch was 84 usec ago
[19940] DEBUG: thread id [19940] last epoch was 140 usec ago
[19940] DEBUG: thread id [19940] last epoch was 197 usec ago
[19940] DEBUG: thread id [19940] last epoch was 253 usec ago
[19940] DEBUG: thread id [19940] last epoch was 309 usec ago
[19940] DEBUG: thread id [19940] last epoch was 365 usec ago
[19940] DEBUG: thread id [19940] last epoch was 421 usec ago
[19940] DEBUG: thread id [19940] last epoch was 478 usec ago
[19940] DEBUG: thread id [19940] last epoch was 537 usec ago
[19940] DEBUG: thread id [19940] last epoch was 593 usec ago
[19940] DEBUG: thread id [19940] last epoch was 712 usec ago
[19940] DEBUG: thread id [19940] last epoch was 770 usec ago
[19940] DEBUG: thread id [19940] last epoch was 827 usec ago
[19940] DEBUG: thread id [19940] last epoch was 883 usec ago
[19940] DEBUG: thread id [19940] last epoch was 939 usec ago
[19940] DEBUG: thread id [19940] last epoch was 995 usec ago
[19940] DEBUG: thread id [19940] last epoch was 1052 usec ago
[19940] DEBUG: thread id [19940] reached max epoch duration (1000 usec)
[19940] DEBUG: interrupting thread [19940]

[19942] WARNING: No complete memory-controller pci topology found in /tmp/mc_pci_bus
[19942] WARNING: EMUL_PROCESSES_PER_SYSTEM value is 1, skipping rank setting
[19945] WARNING: No complete memory-controller pci topology found in /tmp/mc_pci_bus
[19945] WARNING: EMUL_PROCESSES_PER_SYSTEM value is 1, skipping rank setting
hello-in process
[19945] DEBUG: [19945] INFO: Detected CPU model 'Haswell Xeon'
[19945] DEBUG: Possible NUMA nodes are 1024
[19945] DEBUG: NUMA nodes allowed are 1024
[19945] DEBUG: NUMA configured CPUs are 20
[19945] INFO: 10 CPUs on physical node 0
[19945] INFO: 10 CPUs on physical node 1
[19945] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[19945] INFO: node: 1, pci addr: 7f:14.0
[19945] INFO: node: 1, pci bus: 0x7f
[19945] INFO: node: 1, pci addr: 7f:14.1
[19945] INFO: node: 1, pci addr: 7f:17.0
[19945] INFO: node: 1, pci addr: 7f:17.1
[19945] INFO: Number of physical nodes 2
[19945] INFO: Number of virtual nodes 1
[19945] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[19945] INFO: measuring latency: latency is 81 ns
[19945] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[19945] INFO: measuring latency: latency is 123 ns
[19945] INFO: Fusing physical nodes 0 1 into virtual node 0
[19945] INFO: Initializing latency model
[19945] INFO: Registering thread tid [19945]
[19945] INFO: Binding thread tid [19945] pthread: 0x7f0efa5b3c00 on processor 0
[19945] DEBUG: thread id [19945] last epoch was 111181988436 usec ago
[19945] DEBUG: thread id [19945] reached max epoch duration (1000 usec)
[19945] DEBUG: interrupting thread [19945]
[19945] DEBUG: Handling interrupt thread [19945] pthread: 0x7f0efa5b3c00
[19945] DEBUG: thread id [19945] last epoch was 111181988476 usec ago
[19945] DEBUG: thread id [19945] reached min epoch duration (1000 usec)
[19945] DEBUG: read stall L2 cycles diff 254791; llc_hit 1436; cycles diff remote_dram 163; local_dram 811
[19945] DEBUG: overhead cycles: 21212; immediate overhead 21212; stall cycles: 49213; delay cycles: 30808
[19945] DEBUG: injecting delay of 9596 cycles (4 usec) - discounted overhead
[19945] DEBUG: thread id [19945] last epoch was 20 usec ago
[19945] DEBUG: thread id [19945] last epoch was 76 usec ago
[19945] DEBUG: thread id [19945] last epoch was 131 usec ago
[19945] DEBUG: thread id [19945] last epoch was 160 usec ago
[19945] DEBUG: thread id [19945] last epoch was 186 usec ago
[19945] DEBUG: thread id [19945] last epoch was 320 usec ago
[19945] DEBUG: thread id [19945] last epoch was 380 usec ago
[19945] DEBUG: thread id [19945] last epoch was 436 usec ago
[19945] DEBUG: thread id [19945] last epoch was 491 usec ago
[19945] DEBUG: thread id [19945] last epoch was 547 usec ago
[19945] DEBUG: thread id [19945] last epoch was 607 usec ago
[19945] DEBUG: thread id [19945] last epoch was 668 usec ago
[19945] DEBUG: thread id [19945] last epoch was 725 usec ago
[19945] DEBUG: thread id [19945] last epoch was 782 usec ago
[19945] DEBUG: thread id [19945] last epoch was 838 usec ago
[19945] DEBUG: thread id [19945] last epoch was 894 usec ago
[19945] DEBUG: thread id [19945] last epoch was 949 usec ago
[19945] DEBUG: thread id [19945] last epoch was 1005 usec ago
[19945] DEBUG: thread id [19945] reached max epoch duration (1000 usec)
[19945] DEBUG: interrupting thread [19945]
[19945] DEBUG: Handling interrupt thread [19945] pthread: 0x7f0efa5b3c00
[19945] DEBUG: thread id [19945] last epoch was 1079 usec ago
[19945] DEBUG: thread id [19945] reached min epoch duration (1000 usec)
[19945] DEBUG: read stall L2 cycles diff 326444; llc_hit 3851; cycles diff remote_dram 25; local_dram 1304
[19945] DEBUG: overhead cycles: 141910; immediate overhead 141910; stall cycles: 6531; delay cycles: 4088
[19945] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19945] DEBUG: thread id [19945] last epoch was 46 usec ago
[19945] DEBUG: thread id [19945] last epoch was 103 usec ago
[19945] DEBUG: thread id [19945] last epoch was 160 usec ago
[19945] DEBUG: thread id [19945] last epoch was 217 usec ago
[19945] DEBUG: thread id [19945] last epoch was 274 usec ago
[19945] DEBUG: thread id [19945] last epoch was 331 usec ago
[19945] DEBUG: thread id [19945] last epoch was 394 usec ago
[19945] DEBUG: thread id [19945] last epoch was 450 usec ago
[19945] DEBUG: thread id [19945] last epoch was 507 usec ago
[19945] DEBUG: thread id [19945] last epoch was 564 usec ago
[19945] DEBUG: thread id [19945] last epoch was 621 usec ago
[19945] DEBUG: thread id [19945] last epoch was 679 usec ago
[19945] DEBUG: thread id [19945] last epoch was 736 usec ago
[19945] DEBUG: thread id [19945] last epoch was 792 usec ago
[19945] DEBUG: thread id [19945] last epoch was 848 usec ago
[19945] DEBUG: thread id [19945] last epoch was 903 usec ago
[19945] DEBUG: thread id [19945] last epoch was 959 usec ago
[19945] DEBUG: thread id [19945] last epoch was 1016 usec ago
[19945] DEBUG: thread id [19945] reached max epoch duration (1000 usec)
[19945] DEBUG: interrupting thread [19945]
[19945] DEBUG: Handling interrupt thread [19945] pthread: 0x7f0efa5b3c00
[19945] DEBUG: thread id [19945] last epoch was 1068 usec ago
[19945] DEBUG: thread id [19945] reached min epoch duration (1000 usec)
[19945] DEBUG: read stall L2 cycles diff 585597; llc_hit 8919; cycles diff remote_dram 478; local_dram 983
[19945] DEBUG: overhead cycles: 277821; immediate overhead 139999; stall cycles: 132865; delay cycles: 83175
[19945] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19945] DEBUG: thread id [19945] last epoch was 5 usec ago
[19945] DEBUG: thread id [19945] last epoch was 16 usec ago
hello-in init done
Hello World! Process 0 of 1 on node11
[19945] DEBUG: thread id [19945] last epoch was 85 usec ago
[19945] DEBUG: thread id [19945] last epoch was 142 usec ago
[19945] INFO: Registering thread tid [19945]
[19945] INFO: Binding thread tid [19945] pthread: 0x7f0efa5b3c00 on processor 1
[19945] DEBUG: thread id [19945] last epoch was 111181991204 usec ago
[19945] DEBUG: thread id [19945] reached min epoch duration (1000 usec)
[19945] DEBUG: thread id [19945] last epoch was 111181991206 usec ago
[19945] DEBUG: thread id [19945] last epoch was 111181991215 usec ago
[19945] DEBUG: thread id [19945] reached min epoch duration (1000 usec)
[19945] DEBUG: thread id [19945] reached max epoch duration (1000 usec)
[19945] DEBUG: read stall L2 cycles diff 384127; llc_hit 3951; cycles diff remote_dram 345; local_dram 631
[19945] DEBUG: interrupting thread [19945]
[19945] DEBUG: overhead cycles: 217299; immediate overhead 22653; stall cycles: 110403; delay cycles: 69114
[19945] DEBUG: injecting delay of 0 cycles (0 usec) - discounted overhead
[19945] DEBUG: Handling interrupt thread [19945] pthread: 0x7f0efa5b3c00
[19945] DEBUG: thread id [19945] last epoch was 4 usec ago
[19945] DEBUG: thread id [19945] last epoch was 15 usec ago
[19942] DEBUG: [19942] INFO: Detected CPU model 'Haswell Xeon'
[19942] DEBUG: Possible NUMA nodes are 1024
[19942] DEBUG: NUMA nodes allowed are 1024
[19942] DEBUG: NUMA configured CPUs are 20
[19942] INFO: 10 CPUs on physical node 0
[19942] INFO: 10 CPUs on physical node 1
[19942] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[19942] INFO: node: 1, pci addr: 7f:14.0
[19942] INFO: node: 1, pci bus: 0x7f
[19942] INFO: node: 1, pci addr: 7f:14.1
[19942] INFO: node: 1, pci addr: 7f:17.0
[19942] INFO: node: 1, pci addr: 7f:17.1
[19942] INFO: Number of physical nodes 2
[19942] INFO: Number of virtual nodes 1
[19942] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[19942] INFO: measuring latency: latency is 81 ns
[19942] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[19942] INFO: measuring latency: latency is 133 ns
[19942] INFO: Fusing physical nodes 0 1 into virtual node 0
[19942] INFO: Initializing latency model
[19942] INFO: Registering thread tid [19942]
[19942] INFO: Binding thread tid [19942] pthread: 0x7fc4574de7c0 on processor 0
[19942] DEBUG: thread id [19942] last epoch was 111179658945 usec ago
[19942] DEBUG: thread id [19942] reached max epoch duration (1000 usec)
[19942] DEBUG: interrupting thread [19942]
[19942] DEBUG: Handling interrupt thread [19942] pthread: 0x7fc4574de7c0
[19942] DEBUG: thread id [19942] last epoch was 111179658961 usec ago
[19942] DEBUG: thread id [19942] reached min epoch duration (1000 usec)
[19942] DEBUG: read stall L2 cycles diff 239224; llc_hit 1238; cycles diff remote_dram 146; local_dram 690
[19942] DEBUG: overhead cycles: 18026; immediate overhead 18026; stall cycles: 50912; delay cycles: 25647
[19942] DEBUG: injecting delay of 7621 cycles (3 usec) - discounted overhead
[19942] DEBUG: thread id [19942] last epoch was 37 usec ago
[19942] DEBUG: thread id [19942] last epoch was 92 usec ago
[19942] DEBUG: thread id [19942] last epoch was 147 usec ago
[19942] DEBUG: thread id [19942] last epoch was 205 usec ago
[19942] DEBUG: thread id [19942] last epoch was 229 usec ago
[19942] DEBUG: thread id [19942] last epoch was 287 usec ago
[19942] DEBUG: thread id [19942] last epoch was 341 usec ago
[19942] DEBUG: thread id [19942] last epoch was 396 usec ago
[19942] DEBUG: thread id [19942] last epoch was 450 usec ago
[19942] DEBUG: thread id [19942] last epoch was 504 usec ago
[19942] DEBUG: thread id [19942] last epoch was 559 usec ago
[19942] DEBUG: thread id [19942] last epoch was 613 usec ago
[19942] DEBUG: thread id [19942] last epoch was 668 usec ago
[19942] DEBUG: thread id [19942] last epoch was 722 usec ago
[19942] DEBUG: thread id [19942] last epoch was 776 usec ago
[19942] DEBUG: thread id [19942] last epoch was 831 usec ago
[19942] DEBUG: thread id [19942] last epoch was 885 usec ago
[19942] DEBUG: thread id [19942] last epoch was 939 usec ago
[19942] DEBUG: thread id [19942] last epoch was 994 usec ago
[19942] DEBUG: thread id [19942] last epoch was 1048 usec ago
[19942] DEBUG: thread id [19942] reached max epoch duration (1000 usec)
[19942] DEBUG: interrupting thread [19942]

Then we try it to 4 progresses. It loops again. Then I interrupting it.

[20205] DEBUG: thread id [20205] last epoch was 965 usec ago [20205] DEBUG: thread id [20205] last epoch was 1019 usec ago [20205] DEBUG: thread id [20205] reached max epoch duration (1000 usec) [20205] DEBUG: interrupting thread [20205] [20205] DEBUG: Unsetting process local rank [mpiexec@node11] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@node11] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@node11] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion [mpiexec@node11] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion [20203] DEBUG: Unsetting process local rank [20203] ERROR: failed to unset process local rank

So what's wrong with the Quartz and the basic MPI demo?

Gumi-presentation-by-Dzh commented 6 years ago

This helloworld MPI test codes you can see above , we all think this is just a simple MPI program. So we did not expect such a problem at all. Our intention was to use Quartz to test a program we wrote using MPI, so we first run it with this demo. But now we have this problem. Is there something wrong with this MPI demo or did I use Quartz in a wrong way or wrong configuration?

Gumi-presentation-by-Dzh commented 6 years ago

If there are other ways to solve this problem, or to find out where this problem lies, we are willing to try it because we want to use Quartz to study MPI related aspects.

guimagalhaes commented 6 years ago

I have seen many complains about this error when using MVAPICH. I would try with openMPI. You are running the mpirun under the emulator. Le'ts change this strategy: Compile the mpi program (hello) linking with the NVM emulator library and do not use the NVM_EMUL/script/runenv.sh script (with your run.sh). You may need to ser some environment variable such as NVMEMUL_INI, but do not set LD_PRELOAD. After linking the 'hello' program with the emulator library, execute it with with mpirun as any other MPI program.

hkundnani commented 5 years ago

@Gumi-presentation-by-Dzh I have a question why run you running your application using this command sh run.sh mpirun -np 4 /home/liao/hello instead of this mpirun -np 4 run.sh /home/liao/hello.

Please correct me if I am wrong but the former will run the emulator on one node and application on 4 nodes (assuming you give -host) but later will run the emulator on 4 nodes along with application.

hkundnani commented 5 years ago

@Gumi-presentation-by-Dzh Also, I tried to sample code you posted here with the emulator and in my case, it worked without getting hung. But I am facing a similar problem when I am running my application PowerGraph with quartz emulator using mpirun on the muliple nodes. Can you please provide any suggestions if you were able to resolve your issue?

Gumi-presentation-by-Dzh commented 5 years ago

@hkundnani Sorry to answer later.

  1. In "run.sh", we just to package the Quartz's interface which through the "setupdev.sh" to load the module and set the environmental variable, the code as we mentioned above:

The command I use is: sh run.sh mpirun -np 4 /home/liao/hello

The run.sh is: _INPUT_COMMAN=$@ _LOAD_PATH="/home/ZHduan/HME-q/scripts/setupdev.sh" _RUN_PATH="/home/ZHduan/HME-q/scripts/runenv.sh" _ROOT="sudo" _LOAD="load" _RELOAD="reload" echo ${_INPUT_COMMAN} Init() { ${_ROOT} ${_LOAD_PATH} ${_RELOAD} ${_ROOT} ${_LOAD_PATH} ${_LOAD} } Start() { ${_RUN_PATH} ${_INPUT_COMMAN} } Init Start

You can find more detail in README and "setupdev.sh". When the emulator is running, all application (or command) will be emulated in NVM or hybrid memory. We use sh run.sh mpirun -np 4 /home/liao/hello to test the application running in multithreading. So as you said, running application on 4 nodes under emulating.

I am very skeptical about whether ‘mpirun -np 4 run.sh /home/liao/hello’ will create 4 Quartz emulators. But based on my own understanding, this does not create 4 emulators to run the test program at the same time. If there are some PS logs, it will tell the situation.

  1. I am not familiar with PowerGraph and we have not solved the mpirun problem, so there are not suggestions we can tell.
hkundnani commented 5 years ago

@Gumi-presentation-by-Dzh Thank you for your reply. I have attached the log for your sample code using both the commands. The log was huge so I have just attached starting lines to convey my point For sh run.sh mpirun -np 4 -host inv08,inv09,inv11,inv14 /home/liao/hello I have added hosts becasue in my machine I ran it this way:

scripts/../build/src/lib/libnvmemul.so
scripts/../nvmemul.ini
[109346] DEBUG: [109346] INFO: Detected CPU model 'Haswell Xeon'
[109346] DEBUG: Possible NUMA nodes are 1024
[109346] DEBUG: NUMA nodes allowed are 1024
[109346] DEBUG: NUMA configured CPUs are 40
[109346] INFO: 19 CPUs on physical node 0
[109346] INFO: 20 CPUs on physical node 1
[109346] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[109346] INFO: node: 0, pci addr: 7f:14.0
[109346] INFO: node: 0, pci bus: 0x7f
[109346] INFO: node: 0, pci addr: 7f:14.1
[109346] INFO: node: 0, pci addr: 7f:17.0
[109346] INFO: node: 0, pci addr: 7f:17.1
[109346] INFO: node: 1, pci addr: ff:14.0
[109346] INFO: node: 1, pci bus: 0xff
[109346] INFO: node: 1, pci addr: ff:14.1
[109346] INFO: node: 1, pci addr: ff:17.0
[109346] INFO: node: 1, pci addr: ff:17.1
[109346] INFO: Number of physical nodes 2
[109346] INFO: Number of virtual nodes 1
[109346] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[109346] INFO: measuring latency: latency is 88 ns
[109346] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[109346] INFO: measuring latency: latency is 128 ns
[109346] INFO: Fusing physical nodes 0 1 into virtual node 0
[109346] INFO: Initializing latency model
[109346] DEBUG: setting process local rank for 10 local processes
[109346] DEBUG: process local rank is 0 on system inv08
[109346] DEBUG: number of cpus is 19
[109346] DEBUG: no partitioning of CPUs, set next CPU to vnode 0 and cpu 0
[109346] INFO: Registering thread tid [109346]
[109346] INFO: Binding thread tid [109346] pthread: 0x7f7b4e2267c0 on processor 2
[109346] DEBUG: thread id [109346] last epoch was 85 usec ago
[109346] DEBUG: thread id [109346] last epoch was 150 usec ago
[109346] DEBUG: thread id [109346] last epoch was 177 usec ago
[109346] DEBUG: thread id [109346] last epoch was 180 usec ago
------------------------------------------------------------------------------------------------------------------------
[109346] DEBUG: thread id [109346] last epoch was 9872 usec ago
[109346] DEBUG: thread id [109348] last epoch was 7984 usec ago
[109346] DEBUG: thread id [109346] last epoch was 9926 usec ago
[10934[109349] DEBUG: [109349] INFO: Detected CPU model 'Haswell Xeon'
[109349] DEBUG: Possible NUMA nodes are 1024
[109349] DEBUG: NUMA nodes allowed are 1024
[109349] DEBUG: NUMA configured CPUs are 40
[109349] INFO: 19 CPUs on physical node 0
[109349] INFO: 20 CPUs on physical node 1
[109349] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[109349] INFO: node: 0, pci addr: 7f:14.0
[109349] INFO: node: 0, pci bus: 0x7f
[109349] INFO: node: 0, pci addr: 7f:14.1
[109349] INFO: node: 0, pci addr: 7f:17.0
[109349] INFO: node: 0, pci addr: 7f:17.1
[109349] INFO: node: 1, pci addr: ff:14.0
[109349] INFO: node: 1, pci bus: 0xff
[109349] INFO: node: 1, pci addr: ff:14.1
[109349] INFO: node: 1, pci addr: ff:17.0
[109349] INFO: node: 1, pci addr: ff:17.1
[109349] INFO: Number of physical nodes 2
[109349] INFO: Number of virtual nodes 1
[109349] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[109349] INFO: measuring latency: latency is 90 ns
[109349] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[109349] INFO: measuring latency: latency is 133 ns
[109349] INFO: Fusing physical nodes 0 1 into virtual node 0
[109349] INFO: Initializing latency model
[109349] DEBUG: setting process local rank for 10 local processes
[109349] DEBUG: read from file current max rank 1
[109349] DEBUG: write to file new max rank 2
[109349] DEBUG: process local rank is 1 on system inv08
[109349] DEBUG: number of cpus is 19
[109349] DEBUG: no partitioning of CPUs, set next CPU to vnode 0 and cpu 2
[109349] INFO: Registering thread tid [109349]
[109349] INFO: Binding thread tid [109349] pthread: 0x7ff2f6d778c0 on processor 4
[109349] DEBUG: thread id [109349] last epoch was 371 usec ago
[109349] DEBUG: thread id [109349] last epoch was 435 usec ago
[109349] DEBUG: thread id [109349] last epoch was 493 usec ago
-------------------------------------------------------------------------------------------------------------------------
[109346] DEBUG: thread id [109346] last epoch was 9690 usec ago
[109346] DEBUG: thread id [109348] last epoch was 7782 usec ago
[109346] DEBUG: thread id [109346] last epoch was 9744 u[109350] DEBUG: [109350] INFO: Detected CPU model 'Haswell Xeon'
[109350] DEBUG: Possible NUMA nodes are 1024
[109350] DEBUG: NUMA nodes allowed are 1024
[109350] DEBUG: NUMA configured CPUs are 40
[109350] INFO: 19 CPUs on physical node 0
[109350] INFO: 20 CPUs on physical node 1
[109350] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[109350] INFO: node: 0, pci addr: 7f:14.0
[109350] INFO: node: 0, pci bus: 0x7f
[109350] INFO: node: 0, pci addr: 7f:14.1
[109350] INFO: node: 0, pci addr: 7f:17.0
[109350] INFO: node: 0, pci addr: 7f:17.1
[109350] INFO: node: 1, pci addr: ff:14.0
[109350] INFO: node: 1, pci bus: 0xff
[109350] INFO: node: 1, pci addr: ff:14.1
[109350] INFO: node: 1, pci addr: ff:17.0
[109350] INFO: node: 1, pci addr: ff:17.1
[109350] INFO: Number of physical nodes 2
[109350] INFO: Number of virtual nodes 1
[109350] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[109350] INFO: measuring latency: latency is 89 ns
[109350] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[109350] INFO: measuring latency: latency is 135 ns
[109350] INFO: Fusing physical nodes 0 1 into virtual node 0
[109350] INFO: Initializing latency model
[109350] DEBUG: setting process local rank for 10 local processes
[109350] DEBUG: read from file current max rank 2
[109350] DEBUG: write to file new max rank 3
[109350] DEBUG: process local rank is 2 on system inv08
[109350] DEBUG: number of cpus is 19
[109350] DEBUG: no partitioning of CPUs, set next CPU to vnode 0 and cpu 4
[109350] INFO: Registering thread tid [109350]
[109350] INFO: Binding thread tid [109350] pthread: 0x7f9104f048c0 on processor 6
[109350] DEBUG: thread id [109350] last epoch was 65 usec ago
[109350] DEBUG: thread id [109350] last epoch was 125 usec ago
[109350] DEBUG: thread id [109350] last epoch was 182 usec ago
----------------------------------------------------------------------------------------------------------------------------
[109346] DEBUG: thread id [109346] last epoch was 6781 usec ago
[109346] DEBUG: thread id [109346] last epoch was 6782 usec ago
[109346] DEBUG: thread i[109353] DEBUG: [109353] INFO: Detected CPU model 'Haswell Xeon'
[109353] DEBUG: Possible NUMA nodes are 1024
[109353] DEBUG: NUMA nodes allowed are 1024
[109353] DEBUG: NUMA configured CPUs are 40
[109353] INFO: 19 CPUs on physical node 0
[109353] INFO: 20 CPUs on physical node 1
[109353] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[109353] INFO: node: 0, pci bus: 0x7f
[109353] INFO: node: 0, pci addr: 7f:14.1
[109353] INFO: node: 0, pci addr: 7f:17.0
[109353] INFO: node: 0, pci addr: 7f:17.1
[109353] INFO: node: 1, pci addr: ff:14.0
[109353] INFO: node: 1, pci bus: 0xff
[109353] INFO: node: 1, pci addr: ff:14.1
[109353] INFO: node: 1, pci addr: ff:17.0
[109353] INFO: node: 1, pci addr: ff:17.1
[109353] INFO: Number of physical nodes 2
[109353] INFO: Number of virtual nodes 1
[109353] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
d [109346] last epoch was 7094 usec ago
[109346] DEBUG: thread id [109346] last epoch was 7103 usec ago
[109346] DEBUG: thread id [109346] last epoch was 7103 usec ago
[109346] DEBUG: thread id [109348] last epoch was 7060 usec ago

The ouput for above command is shown below and which I think is the intended output. But if you see in the log it detects four processors Haswell Xeon and the emulation also happened on each node Also, I can see four process id in the log (109346, 109349, 109350, 109353).

Hello World! Process 0 of 4 on inv08
Hello World! Process 1 of 4 on inv09
Hello World! Process 2 of 4 on inv11
Hello World! Process 3 of 4 on inv14

On the other hand for this command mpirun -np 4 -host inv08,inv09,inv11,inv14 run.sh /home/liao/hello:

scripts/../build/src/lib/libnvmemul.so
scripts/../nvmemul.ini
scripts/../build/src/lib/libnvmemul.so
scripts/../nvmemul.ini
scripts/../build/src/lib/libnvmemul.so
scripts/../nvmemul.ini
scripts/../build/src/lib/libnvmemul.so
scripts/../nvmemul.ini
[125020] DEBUG: [125020] INFO: Detected CPU model 'Haswell Xeon'
[125020] DEBUG: Possible NUMA nodes are 1024
[125020] DEBUG: NUMA nodes allowed are 1024
[125020] DEBUG: NUMA configured CPUs are 40
[125020] INFO: 20 CPUs on physical node 0
[125020] INFO: 20 CPUs on physical node 1
[125020] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[125020] INFO: node: 0, pci addr: 7f:14.0
[125020] INFO: node: 0, pci bus: 0x7f
[125020] INFO: node: 0, pci addr: 7f:14.1
[125020] INFO: node: 0, pci addr: 7f:17.0
[125020] INFO: node: 0, pci addr: 7f:17.1
[125020] INFO: Number of physical nodes 2
[125020] INFO: Number of virtual nodes 1
[125020] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[10121] DEBUG: [10121] INFO: Detected CPU model 'Haswell Xeon'
[10121] DEBUG: Possible NUMA nodes are 1024
[10121] DEBUG: NUMA nodes allowed are 1024
[10121] DEBUG: NUMA configured CPUs are 40
[26005] DEBUG: [26005] INFO: Detected CPU model 'Haswell Xeon'
[10121] INFO: 20 CPUs on physical node 0
[26005] DEBUG: Possible NUMA nodes are 1024
[26005] DEBUG: NUMA nodes allowed are 1024
[26005] DEBUG: NUMA configured CPUs are 40
[10121] INFO: 20 CPUs on physical node 1
[26005] INFO: 20 CPUs on physical node 0
[10121] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[26005] INFO: 20 CPUs on physical node 1
[10121] INFO: node: 0, pci addr: 7f:14.0
[10121] INFO: node: 0, pci bus: 0x7f
[10121] INFO: node: 0, pci addr: 7f:14.1
[10121] INFO: node: 0, pci addr: 7f:17.0
[10121] INFO: node: 0, pci addr: 7f:17.1
[10121] INFO: Number of physical nodes 2
[10121] INFO: Number of virtual nodes 1
[26005] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[26005] INFO: Number of physical nodes 2
[26005] INFO: Number of virtual nodes 1
[10121] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[26005] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[109372] DEBUG: [109372] INFO: Detected CPU model 'Haswell Xeon'
[109372] DEBUG: Possible NUMA nodes are 1024
[109372] DEBUG: NUMA nodes allowed are 1024
[109372] DEBUG: NUMA configured CPUs are 40
[109372] INFO: 19 CPUs on physical node 0
[109372] INFO: 20 CPUs on physical node 1
[109372] INFO: Loading memory-controller pci topology from /tmp/mc_pci_bus
[109372] INFO: node: 0, pci addr: 7f:14.0
[109372] INFO: node: 0, pci bus: 0x7f
[109372] INFO: node: 0, pci addr: 7f:14.1
[109372] INFO: node: 0, pci addr: 7f:17.0
[109372] INFO: node: 0, pci addr: 7f:17.1
[109372] INFO: node: 1, pci addr: ff:14.0
[109372] INFO: node: 1, pci bus: 0xff
[109372] INFO: node: 1, pci addr: ff:14.1
[109372] INFO: node: 1, pci addr: ff:17.0
[109372] INFO: node: 1, pci addr: ff:17.1
[109372] INFO: Number of physical nodes 2
[109372] INFO: Number of virtual nodes 1
[109372] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 0
[125020] INFO: measuring latency: latency is 82 ns
[125020] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[26005] INFO: measuring latency: latency is 83 ns
[26005] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[10121] INFO: measuring latency: latency is 83 ns
[10121] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[109372] INFO: measuring latency: latency is 90 ns
[109372] INFO: measuring latency: nchains 1, nelems 4096000, elem_sz 64, access_sz 8, from_node_id 0, to_node_id 1
[125020] INFO: measuring latency: latency is 125 ns
[125020] INFO: Fusing physical nodes 0 1 into virtual node 0
[125020] INFO: Initializing latency model
[125020] DEBUG: setting process local rank for 10 local processes
[125020] DEBUG: process local rank is 0 on system inv09
[125020] DEBUG: number of cpus is 20
[125020] DEBUG: partitioning CPUS, this process has CPUs from 0 and 1
[125020] DEBUG: disabling CPU 4
[125020] DEBUG: disabling CPU 6
[125020] DEBUG: disabling CPU 8
[125020] DEBUG: disabling CPU 10
[125020] DEBUG: disabling CPU 12
[125020] DEBUG: disabling CPU 14
[125020] DEBUG: disabling CPU 16
[125020] DEBUG: disabling CPU 18
[125020] DEBUG: disabling CPU 20
[125020] DEBUG: disabling CPU 22
[125020] DEBUG: disabling CPU 24
[125020] DEBUG: disabling CPU 26
[125020] DEBUG: disabling CPU 28
[125020] DEBUG: disabling CPU 30
[125020] DEBUG: disabling CPU 32
[125020] DEBUG: disabling CPU 34
[125020] DEBUG: disabling CPU 36
[125020] DEBUG: disabling CPU 38
[125020] INFO: Registering thread tid [125020]
[125020] INFO: Binding thread tid [125020] pthread: 0x7f709f5d9840 on processor 0
[125020] DEBUG: thread id [125020] last epoch was 109 usec ago
[125020] DEBUG: thread id [125020] last epoch was 176 usec ago
[125020] DEBUG: thread id [125020] last epoch was 234 usec ago
[125020] DEBUG: thread id [125020] last epoch was 292 usec ago
[125020] DEBUG: thread id [125020] last epoch was 349 usec ago
[125020] DEBUG: thread id [125020] last epoch was 407 usec ago
[125020] DEBUG: thread id [125020] last epoch was 464 usec ago

The ouput for above command is shown below and which is same as above output. If you see in the log it detects four processors of four different machines Haswell Xeon and the emulation on all four nodes and I can also see four different processes in the above command with process id as 125020, 26005, 10121, 109372.

Hello World! Process 0 of 4 on inv08
Hello World! Process 1 of 4 on inv09
Hello World! Process 2 of 4 on inv11
Hello World! Process 3 of 4 on inv14

So based on first log and second log the only difference i feel is libnvmemul.so and nvmemul.ini file loded once in first case and four times in second case. Can you please share your observation based on these logs. Also, what do you think is the difference between both cases.

Gumi-presentation-by-Dzh commented 5 years ago

@hkundnani

Overall, the results are not much different. As you can see in my 'run.sh', we use './scripts/runenv.sh' to run the emulator. In 'runenv.sh', at line 45 and 46, it sets the environment variables of Quartz. This is why you can see scripts/../build/src/lib/libnvmemul.so scripts/../nvmemul.ini four times. ('runenv.sh line 53 and 54').

In my 'run.sh', we first reload the Quartz's lib then run the 'runenv.sh'. In the logs your provide, the emulator initialization is coming above the test running. So we guess that in the end there is only one emulator at work.

In Quartz's ReadME file, you can see the way to emulating you program without './scripts/runenv.sh'. You just need to load the Quartz's lib and set the environment variables we mentioned in './scripts/runenv.sh'. Then, all the commands you run will be executed in the simulation environment.

hkundnani commented 5 years ago

@Gumi-presentation-by-Dzh

I saw the way to emulate the program without using './scripts/runenv.sh' and was able to run your sample code without any issue. I tried a few other applications like GraphMat and Gemini and was able to run those too without any issues on the emulator. Only PowerGraph gets hung at the certain code which uses MPI_Barrier inside a loop. Maybe I will need to go deeper into the code.

Thank you for replying to such an old thread.

Gumi-presentation-by-Dzh commented 5 years ago

@hkundnani If you find more information about MPI_Barrier or solve this problem, let us know.