StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
681 stars 145 forks source link

Unable to run legion implementation #1353

Open ClProsser opened 1 year ago

ClProsser commented 1 year ago

We implemented a 3d finite-volume simulation using Legion (v22.06.0) in C++. The solution produces correct results for arbitrary problem and partition sizes. We want to execute our implementation using legion on a cluster. However, we experience multiple problems and appreciate help, as we found nearly no documentation regarding multi-node execution.

Our cluster consists of 8 nodes. Each node has:

Job submission is done via Slurm (Version 22.05.2).

Below are our questions:

  1. Is there any up-to-date documentation regarding execution? We found:

    Thus, both links don’t really contribute to our problem.

  2. According to the profiling machine configuration page, -ll:cpu is used to specify the number of processors to allocate on each node. In our configuration with 8 nodes and 8 cores per node, we chose to use -ll:cpu 7 and -ll:util 1. This results in oversubscription, while the nodes are not oversubscribed.

    • Error: [0 - 7f232437d780] {4}{threads}: reservation ('CPU proc 1d00000000000008') cannot be satisfied
    • According to our tests, this error vanishes if the sum of both parameters is smaller than the number of cores per socket (in our configuration using -ll:cpu 3 and -ll:util 1). However, according to the documentation, typically only one Legion instance is running per node (See Debugging)
    • To exclude the potential for wrong SLURM configurations, we executed the program on the VSC-5 cluster with adjusted parameters but obtained the same result.
  3. Using -ll:show_rsrv, we obtain output as seen below: CPU proc 1d00000000000002: allocated <> Notable is the missing CPU id after allocated. What ids are stated here (i.e. are these the processor ids from /proc/cpuinfo)? Any suggestions why these are missing? How does legion obtain the internal mapping id 1d00000000000002?

    • Same exists on the VSC-5 cluster with adjusted parameters
  4. -lg:prof <n> and -lg:prof_logfile ./prof_%.gz write to one output file, regardless of the specified number of nodes (in our case 8)

    • The resulting file is not parsable using the legion_prof.py script (using the multi-node configuration), probably as multiple processes are writing to the same file in parallel.

Jobscript on our cluster (submitted via sbatch):

#!/bin/bash

#SBATCH -J cronos-amr-parallel
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --output=output_%N.out
#SBATCH --error=error_%N.out
#SBATCH --nodes=8
#SBATCH --exclusive

srun -N8 -n8 --cpus-per-task 8 --output output_%N.log ./cronos-amr ./configuration/shock-tube-integration.toml -ll:cpu 7 -ll:util 1 -ll:show_rsrv -lg:prof 8 -lg:prof_logfile ./prof_%.gz

Logs and prof output can be found here: logs.zip

(we know that for real performance numbers, we should implement an environment specific mapper)

It would be great if you could help us, we’d like to compare our legion implementation with our OpenMP and MPI implementation.

Thanks 🙂

rohany commented 1 year ago

Try adding --cpu-bind=none or the corresponding command for slurm to not bind the process to physical cores. Some of the errors you are seeing are seemingly due to slurm only binding you to a socket, not the full processor.

How are you invoking legion prof? What error are you seeing? Note that legion prof won't succeed if the application exited with an error.

ClProsser commented 1 year ago

Thanks @rohany for your help. I added --cpu-bind=none to srun but the resulting behavior seems the same. Logs:

cpubind.zip

How are you invoking legion prof? What error are you seeing? Note that legion prof won't succeed if the application exited with an error.

The application exits successfully. I invoke legion prof using:

python3 ./legion_prof.py ./prof_0.gz

legion_prof.py terminates successfully. While the legion_prof directory is created, an index.html is missing. Note that this works if executed with only one node and one task.

Reading log file ./prof_0.gz...
parsing ./prof_0.gz
Matched 8386 objects
Generating interactive visualization files in directory legion_prof
emitting utilization
elapsed: 0.3948557376861572s
rohany commented 1 year ago

This log:

core map {
  domain 0 {
    core 0 { ids=<0> alu=<1> fpu=<1> ldst=<1> }
    core 1 { ids=<1> alu=<0> fpu=<0> ldst=<0> }
    core 2 { ids=<2> alu=<3> fpu=<3> ldst=<3> }
    core 3 { ids=<3> alu=<2> fpu=<2> ldst=<2> }
    core 4 { ids=<4> alu=<5> fpu=<5> ldst=<5> }
    core 5 { ids=<5> alu=<4> fpu=<4> ldst=<4> }
    core 6 { ids=<6> alu=<7> fpu=<7> ldst=<7> }
    core 7 { ids=<7> alu=<6> fpu=<6> ldst=<6> }
  }
}
CPU proc 1d00000000000002: allocated <>
CPU proc 1d00000000000003: allocated <>
CPU proc 1d00000000000004: allocated <>
CPU proc 1d00000000000005: allocated <>
CPU proc 1d00000000000006: allocated <>
CPU proc 1d00000000000007: allocated <>
dedicated worker (generic) #2: allocated <>
dedicated worker (generic) #1: allocated <>
utility proc 1d00000000000000: allocated <>
CPU proc 1d00000000000001: allocated <>

shows that adding --cpu-bind=none helped in that realm at least thinks that you have 8 CPU cores available. However, it looks like 2 more background workers could not be allocated threads (likely for message handling and communication in the multi-node setting. Try running with -ll:cpu 5 -ll:util 1.

rohany commented 1 year ago

I don't know about the legion prof issue. Maybe @elliottslaughter ?

ClProsser commented 1 year ago

shows that adding --cpu-bind=none helped in that realm at least thinks that you have 8 CPU cores available.

This was already the case without --cpu-bind=none (for reference, consider the first logs I posted).

Try running with -ll:cpu 5 -ll:util 1.

Besides using 6 cores now, the output seems the same.

cpu6.zip

elliottslaughter commented 1 year ago

Maybe try the Rust Legion Prof: https://legion.stanford.edu/profiling/#rust-legion-prof ? I don't maintain the Python profiler and Rust will be faster anyway.

rohany commented 1 year ago

This was already the case without --cpu-bind=none (for reference, consider the first logs I posted).

Ah, I'm sorry, I didn't see those original logs. I don't have an idea about what could be wrong further with these core bindings -- @streichler do you know?

streichler commented 1 year ago

There are only 4 CPU cores in that picture. Each has 2 hyperthreads, but Realm doesn't want to use them by default. To get dedicated cores, you can either lower -ll:cpu to 3, or you can add -ll:ht_sharing 0 which will tell Realm to pretend that hyperthreads do not share physical execution resources. (Whether or not this lie causes performance problems tends to be application specific.)

rohany commented 1 year ago

How did you tell that the cores were hyperthreads?

streichler commented 1 year ago

These lines:

core 0 { ids=<0> alu=<1> fpu=<1> ldst=<1> } core 1 { ids=<1> alu=<0> fpu=<0> ldst=<0> }

indicate that what Linux calls "core 0" and "core 1" share an ALU, an FPU, and a load/store unit. Same for the other pairs of "cores".

rohany commented 1 year ago

indicate that what Linux calls "core 0" and "core 1" share an ALU, an FPU, and a load/store unit.

How did you figure this out though? I was expecting that they would have the same alu ID or something, but since they are all different here I wasn't able to be sure that it was a hyperthread

streichler commented 1 year ago

Sorry, alu=<1> means that the core (core 0, in this case) shares an alu with core 1. On a P9 system, you'll see things like:

core 0 { ids=<0> alu=<1,2,3> ... }

because it has 4 "hyperthreads" (can't remember what IBM actually calls them) per CPU core.

ClProsser commented 1 year ago

Thanks @elliottslaughter and @streichler,

-ll:ht_sharing 0 changes a lot. Maybe this should be added to the documentation (at least I could not find it). This solved the problems mentioned in questions 2 and 3. What I can't understand: Based on your explanation there are two threads per core, however, according to lscpu, the thread(s) per core value is 1. Thus, hyperthreading is disabled (or in our case: the hardware doesn't support hyperthreading). Concluding, shouldn't this flag have no effect in my case?

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 23
Model name:            Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz
Stepping:              10
CPU MHz:               2500.000
CPU max MHz:           2500.0000
CPU min MHz:           2000.0000
BogoMIPS:              4987.44
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              6144K
NUMA node0 CPU(s):     0-7

Logfiles:

hyperthreading-2.zip

Note: Based on rohany's recommendation to change to -ll:cpu 5, I forgot to change it to 7 again afterwards. I reuploaded correct logfiles with the expected -ll:cpu 7 parameter. Sorry for that.


Regarding legion prof: I tried to run the rust version of legion prof on the provided prof_0.gz of hyperthreading.zip. However, I obtain the following error: legion_prof_rust.log

Do you see any reason why Legion prof is only creating one prof output file, even if there should be one per node? Thanks

elliottslaughter commented 1 year ago

@ClProsser that is the error you get when the version of Legion Prof is out of sync with the version of Legion you ran the application on. It is important that when you use it, you use exactly the same Legion commit, because the Legion Prof logger format changes from time to time and is not backwards compatible.

You will need a flag -lg:prof N if you want to get log files on N nodes.

streichler commented 1 year ago

Hmm... that shows 2 sockets, but only 1 NUMA domain, which is also odd. Do you have lstopo on that system? If so, can you paste the output of lstopo -of txt?

ClProsser commented 1 year ago

Thus, hyperthreading is disabled

Edit: I checked the hardware; the CPU's don't even support hyperthreading.

@ClProsser that is the error you get when the version of Legion Prof is out of sync with the version of Legion you ran the application on. It is important that when you use it, you use exactly the same Legion commit, because the Legion Prof logger format changes from time to time and is not backwards compatible.

Thanks for that hint, I messed up the profiler tools version. Using the same version and the rust implementation, I can obtain valid html output (attached below). The profiler only shows one node, while I specified -lg:prof 8 while running legion:

legion_prof.zip

You will need a flag -lg:prof N if you want to get log files on N nodes.

This is defined in our jobscript, as mentioned in the original post. Below is again my current jobscript:

#!/bin/bash

#SBATCH -J cronos-amr-parallel

#SBATCH --nodes=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --output=output_%N.out
#SBATCH --error=error_%N.out
#SBATCH --mem=0
#SBATCH --exclusive

srun -N8 -n8 --cpu-bind=none --cpus-per-task 8 --output output_%N.log ./cronos-amr ./configuration/shock-tube-integration.toml -ll:cpu 7 -ll:util 1 -ll:ht_sharing 0 -ll:show_rsrv -lg:prof 8 -lg:prof_logfile ./prof_%.gz

which is besides -ll:ht_sharing 0 the same as in the original post.

Do you have any other idea why the profiler is not resulting in multiple prof_%.gz files? To me this looks like Legion does not recognize that multiple nodes are available.

Hmm... that shows 2 sockets, but only 1 NUMA domain, which is also odd. Do you have lstopo on that system? If so, can you paste the output of lstopo -of txt?

Here are both outputs (hwloc-ls_--of_txt.txt containing the lstopo -of txt output, and hwloc-ls.txt containing the lstopo output):

hwloc-ls_--of_txt.txt hwloc-ls.txt Note: This has been executed on an actual node used for computation; with the same parameters defined in the jobscript (i.e. 8 nodes, 8 cpus per task, 1 task per node)

elliottslaughter commented 1 year ago

If you are not getting multiple log files with -lg:prof 8, then there is definitely something wrong. The most likely causes are:

  1. Legion was not compiled with GASNet.
  2. Somehow GASNet is not connecting to SLURM properly. (This is much less likely because we use a method that is pretty robust.)

How did you install Legion?

ClProsser commented 1 year ago

Thanks for this hint elliott, it was not clear to us that GASNet is a requirement for multi-node execution. I'm currently investing time into the setup and execution using GASNet and share my results in the upcomming days.

ClProsser commented 1 year ago

Hi,

some days and many segfaults have passed, finally I have a working implementation and setup. Thanks elliott for your final hint.

Legion was not compiled with GASNet.

This was actually the case. Once I got GASNet running (and several segfaults as it only ran on a single node), this worked. Do you have any idea why -ll:ht_sharing 0 is still required in our configuration?

Thanks.

elliottslaughter commented 1 year ago

@ClProsser can you remind me what your -ll:show_rsrv output looks like without -ll:ht_sharing 0, and the rest of your command line?

Either something is broken in the CPU auto-detect (unlikely, I think?), or you're still oversubscribing the machine and -ll:ht_sharing 0 is basically just papering over that fact.

ClProsser commented 1 year ago

-ll:show_rsvv produces the following output: rsrv_output.log This test was done using the command line:

GASNET_USE_XRC=0 /usr/bin/time gasnetrun_ibv -n 8 -N 8 cronos-amr ./configuration/shock-tube-integration2.toml -ll:show_rsrv -ll:cpu 6 -ll:util 2 -lg:prof 8 -lg:prof_logfile ./prof_%.gz

Hwloc was already posted, if needed.

elliottslaughter commented 1 year ago

Ok, so I guess there is genuinely a bug in Legion's auto-detection of the machine here. It is detecting hyperthreads, somehow, when those do not actually exist.

Pinging @streichler to look at this.

ClProsser commented 1 year ago

Thanks for the investigation. Regarding profiling: Should we expect any overhead based on the use of -ll:ht_sharing 0 in this case?

elliottslaughter commented 1 year ago

Because your machine does not actually have hyperthreads, -ll:ht_sharing 0 should be a strict improvement. Normally we'd advise against it because sharing hyperthreads on a physical core is often not a performance win, but this is moot when you don't have hyperthreads.

ClProsser commented 1 year ago

I just add detailed outputs for our primary machine, VSC-5: Each node has two 2 AMD EPYC 7713 with 64 cores per proc, thus 128 cores per node. We obtain similar errors:

The jobscript:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2

hwloc-ls --of txt > ./hwloc-ls.txt
./gasnetrun_ibv -n $SLURM_JOB_NUM_NODES -N $SLURM_JOB_NUM_NODES cronos-amr ./configuration/shock-tube-integration.toml -ll:cpu 1 -ll:util 1 -ll:show_rsrv

Hwloc txt output: hwloc-ls.txt

-ll:show_rsvv output:

[0 - 14799f9988c0]    0.000121 {4}{threads}: reservation ('dedicated worker (generic) #1') cannot be satisfied
core map {
  domain 0 {
    core 0 { ids=<0> alu=<128> fpu=<128> ldst=<128> }
    core 128 { ids=<128> alu=<0> fpu=<0> ldst=<0> }
  }
}
dedicated worker (generic) #1: allocated <>
dedicated worker (generic) #2: allocated <>
utility proc 1d00000000000000: allocated <>
CPU proc 1d00000000000001: allocated <>

Using -ll:ht_sharing 0 is not an option, as the the procs support hypterthreads.