Open ClProsser opened 1 year ago
Try adding --cpu-bind=none
or the corresponding command for slurm to not bind the process to physical cores. Some of the errors you are seeing are seemingly due to slurm only binding you to a socket, not the full processor.
How are you invoking legion prof? What error are you seeing? Note that legion prof won't succeed if the application exited with an error.
Thanks @rohany for your help.
I added --cpu-bind=none
to srun
but the resulting behavior seems the same. Logs:
How are you invoking legion prof? What error are you seeing? Note that legion prof won't succeed if the application exited with an error.
The application exits successfully. I invoke legion prof using:
python3 ./legion_prof.py ./prof_0.gz
legion_prof.py
terminates successfully. While the legion_prof
directory is created, an index.html
is missing. Note that this works if executed with only one node and one task.
Reading log file ./prof_0.gz...
parsing ./prof_0.gz
Matched 8386 objects
Generating interactive visualization files in directory legion_prof
emitting utilization
elapsed: 0.3948557376861572s
This log:
core map {
domain 0 {
core 0 { ids=<0> alu=<1> fpu=<1> ldst=<1> }
core 1 { ids=<1> alu=<0> fpu=<0> ldst=<0> }
core 2 { ids=<2> alu=<3> fpu=<3> ldst=<3> }
core 3 { ids=<3> alu=<2> fpu=<2> ldst=<2> }
core 4 { ids=<4> alu=<5> fpu=<5> ldst=<5> }
core 5 { ids=<5> alu=<4> fpu=<4> ldst=<4> }
core 6 { ids=<6> alu=<7> fpu=<7> ldst=<7> }
core 7 { ids=<7> alu=<6> fpu=<6> ldst=<6> }
}
}
CPU proc 1d00000000000002: allocated <>
CPU proc 1d00000000000003: allocated <>
CPU proc 1d00000000000004: allocated <>
CPU proc 1d00000000000005: allocated <>
CPU proc 1d00000000000006: allocated <>
CPU proc 1d00000000000007: allocated <>
dedicated worker (generic) #2: allocated <>
dedicated worker (generic) #1: allocated <>
utility proc 1d00000000000000: allocated <>
CPU proc 1d00000000000001: allocated <>
shows that adding --cpu-bind=none
helped in that realm at least thinks that you have 8 CPU cores available. However, it looks like 2 more background workers could not be allocated threads (likely for message handling and communication in the multi-node setting. Try running with -ll:cpu 5 -ll:util 1
.
I don't know about the legion prof issue. Maybe @elliottslaughter ?
shows that adding --cpu-bind=none helped in that realm at least thinks that you have 8 CPU cores available.
This was already the case without --cpu-bind=none
(for reference, consider the first logs I posted).
Try running with -ll:cpu 5 -ll:util 1.
Besides using 6 cores now, the output seems the same.
Maybe try the Rust Legion Prof: https://legion.stanford.edu/profiling/#rust-legion-prof ? I don't maintain the Python profiler and Rust will be faster anyway.
This was already the case without --cpu-bind=none (for reference, consider the first logs I posted).
Ah, I'm sorry, I didn't see those original logs. I don't have an idea about what could be wrong further with these core bindings -- @streichler do you know?
There are only 4 CPU cores in that picture. Each has 2 hyperthreads, but Realm doesn't want to use them by default.
To get dedicated cores, you can either lower -ll:cpu
to 3, or you can add -ll:ht_sharing 0
which will tell Realm to pretend that hyperthreads do not share physical execution resources. (Whether or not this lie causes performance problems tends to be application specific.)
How did you tell that the cores were hyperthreads?
These lines:
core 0 { ids=<0> alu=<1> fpu=<1> ldst=<1> } core 1 { ids=<1> alu=<0> fpu=<0> ldst=<0> }
indicate that what Linux calls "core 0" and "core 1" share an ALU, an FPU, and a load/store unit. Same for the other pairs of "cores".
indicate that what Linux calls "core 0" and "core 1" share an ALU, an FPU, and a load/store unit.
How did you figure this out though? I was expecting that they would have the same alu ID or something, but since they are all different here I wasn't able to be sure that it was a hyperthread
Sorry, alu=<1>
means that the core (core 0, in this case) shares an alu with core 1. On a P9 system, you'll see things like:
core 0 { ids=<0> alu=<1,2,3> ... }
because it has 4 "hyperthreads" (can't remember what IBM actually calls them) per CPU core.
Thanks @elliottslaughter and @streichler,
-ll:ht_sharing 0
changes a lot. Maybe this should be added to the documentation (at least I could not find it). This solved the problems mentioned in questions 2 and 3. What I can't understand: Based on your explanation there are two threads per core, however, according to lscpu
, the thread(s) per core value is 1. Thus, hyperthreading is disabled (or in our case: the hardware doesn't support hyperthreading). Concluding, shouldn't this flag have no effect in my case?
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 23
Model name: Intel(R) Xeon(R) CPU L5420 @ 2.50GHz
Stepping: 10
CPU MHz: 2500.000
CPU max MHz: 2500.0000
CPU min MHz: 2000.0000
BogoMIPS: 4987.44
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 6144K
NUMA node0 CPU(s): 0-7
Logfiles:
Note: Based on rohany's recommendation to change to -ll:cpu 5
, I forgot to change it to 7 again afterwards. I reuploaded correct logfiles with the expected -ll:cpu 7
parameter. Sorry for that.
Regarding legion prof: I tried to run the rust version of legion prof on the provided prof_0.gz
of hyperthreading.zip
. However, I obtain the following error:
legion_prof_rust.log
Do you see any reason why Legion prof is only creating one prof output file, even if there should be one per node? Thanks
@ClProsser that is the error you get when the version of Legion Prof is out of sync with the version of Legion you ran the application on. It is important that when you use it, you use exactly the same Legion commit, because the Legion Prof logger format changes from time to time and is not backwards compatible.
You will need a flag -lg:prof N
if you want to get log files on N
nodes.
Hmm... that shows 2 sockets, but only 1 NUMA domain, which is also odd. Do you have lstopo
on that system? If so, can you paste the output of lstopo -of txt
?
Thus, hyperthreading is disabled
Edit: I checked the hardware; the CPU's don't even support hyperthreading.
@ClProsser that is the error you get when the version of Legion Prof is out of sync with the version of Legion you ran the application on. It is important that when you use it, you use exactly the same Legion commit, because the Legion Prof logger format changes from time to time and is not backwards compatible.
Thanks for that hint, I messed up the profiler tools version. Using the same version and the rust implementation, I can obtain valid html output (attached below). The profiler only shows one node, while I specified -lg:prof 8
while running legion:
You will need a flag -lg:prof N if you want to get log files on N nodes.
This is defined in our jobscript, as mentioned in the original post. Below is again my current jobscript:
#!/bin/bash
#SBATCH -J cronos-amr-parallel
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --output=output_%N.out
#SBATCH --error=error_%N.out
#SBATCH --mem=0
#SBATCH --exclusive
srun -N8 -n8 --cpu-bind=none --cpus-per-task 8 --output output_%N.log ./cronos-amr ./configuration/shock-tube-integration.toml -ll:cpu 7 -ll:util 1 -ll:ht_sharing 0 -ll:show_rsrv -lg:prof 8 -lg:prof_logfile ./prof_%.gz
which is besides -ll:ht_sharing 0
the same as in the original post.
Do you have any other idea why the profiler is not resulting in multiple prof_%.gz
files? To me this looks like Legion does not recognize that multiple nodes are available.
Hmm... that shows 2 sockets, but only 1 NUMA domain, which is also odd. Do you have lstopo on that system? If so, can you paste the output of lstopo -of txt?
Here are both outputs (hwloc-ls_--of_txt.txt
containing the lstopo -of txt
output, and hwloc-ls.txt
containing the lstopo
output):
hwloc-ls_--of_txt.txt hwloc-ls.txt Note: This has been executed on an actual node used for computation; with the same parameters defined in the jobscript (i.e. 8 nodes, 8 cpus per task, 1 task per node)
If you are not getting multiple log files with -lg:prof 8
, then there is definitely something wrong. The most likely causes are:
How did you install Legion?
Thanks for this hint elliott, it was not clear to us that GASNet is a requirement for multi-node execution. I'm currently investing time into the setup and execution using GASNet and share my results in the upcomming days.
Hi,
some days and many segfaults have passed, finally I have a working implementation and setup. Thanks elliott for your final hint.
Legion was not compiled with GASNet.
This was actually the case. Once I got GASNet running (and several segfaults as it only ran on a single node), this worked. Do you have any idea why -ll:ht_sharing 0
is still required in our configuration?
Thanks.
@ClProsser can you remind me what your -ll:show_rsrv
output looks like without -ll:ht_sharing 0
, and the rest of your command line?
Either something is broken in the CPU auto-detect (unlikely, I think?), or you're still oversubscribing the machine and -ll:ht_sharing 0
is basically just papering over that fact.
-ll:show_rsvv
produces the following output:
rsrv_output.log
This test was done using the command line:
GASNET_USE_XRC=0 /usr/bin/time gasnetrun_ibv -n 8 -N 8 cronos-amr ./configuration/shock-tube-integration2.toml -ll:show_rsrv -ll:cpu 6 -ll:util 2 -lg:prof 8 -lg:prof_logfile ./prof_%.gz
Hwloc was already posted, if needed.
Ok, so I guess there is genuinely a bug in Legion's auto-detection of the machine here. It is detecting hyperthreads, somehow, when those do not actually exist.
Pinging @streichler to look at this.
Thanks for the investigation.
Regarding profiling: Should we expect any overhead based on the use of -ll:ht_sharing 0
in this case?
Because your machine does not actually have hyperthreads, -ll:ht_sharing 0
should be a strict improvement. Normally we'd advise against it because sharing hyperthreads on a physical core is often not a performance win, but this is moot when you don't have hyperthreads.
I just add detailed outputs for our primary machine, VSC-5: Each node has two 2 AMD EPYC 7713 with 64 cores per proc, thus 128 cores per node. We obtain similar errors:
The jobscript:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
hwloc-ls --of txt > ./hwloc-ls.txt
./gasnetrun_ibv -n $SLURM_JOB_NUM_NODES -N $SLURM_JOB_NUM_NODES cronos-amr ./configuration/shock-tube-integration.toml -ll:cpu 1 -ll:util 1 -ll:show_rsrv
Hwloc txt output: hwloc-ls.txt
-ll:show_rsvv
output:
[0 - 14799f9988c0] 0.000121 {4}{threads}: reservation ('dedicated worker (generic) #1') cannot be satisfied
core map {
domain 0 {
core 0 { ids=<0> alu=<128> fpu=<128> ldst=<128> }
core 128 { ids=<128> alu=<0> fpu=<0> ldst=<0> }
}
}
dedicated worker (generic) #1: allocated <>
dedicated worker (generic) #2: allocated <>
utility proc 1d00000000000000: allocated <>
CPU proc 1d00000000000001: allocated <>
Using -ll:ht_sharing 0
is not an option, as the the procs support hypterthreads.
We implemented a 3d finite-volume simulation using Legion (
v22.06.0
) in C++. The solution produces correct results for arbitrary problem and partition sizes. We want to execute our implementation using legion on a cluster. However, we experience multiple problems and appreciate help, as we found nearly no documentation regarding multi-node execution.Our cluster consists of 8 nodes. Each node has:
Job submission is done via Slurm (Version
22.05.2
).Below are our questions:
Is there any up-to-date documentation regarding execution? We found:
Thus, both links don’t really contribute to our problem.
According to the profiling machine configuration page,
-ll:cpu
is used to specify the number of processors to allocate on each node. In our configuration with 8 nodes and 8 cores per node, we chose to use-ll:cpu 7
and-ll:util 1
. This results in oversubscription, while the nodes are not oversubscribed.[0 - 7f232437d780] {4}{threads}: reservation ('CPU proc 1d00000000000008') cannot be satisfied
-ll:cpu 3
and-ll:util 1
). However, according to the documentation, typically only one Legion instance is running per node (See Debugging)Using
-ll:show_rsrv
, we obtain output as seen below:CPU proc 1d00000000000002: allocated <>
Notable is the missing CPU id afterallocated
. What ids are stated here (i.e. are these the processor ids from/proc/cpuinfo
)? Any suggestions why these are missing? How does legion obtain the internal mapping id1d00000000000002
?-lg:prof <n>
and-lg:prof_logfile ./prof_%.gz
write to one output file, regardless of the specified number of nodes (in our case8
)legion_prof.py
script (using the multi-node configuration), probably as multiple processes are writing to the same file in parallel.Jobscript on our cluster (submitted via
sbatch
):Logs and prof output can be found here: logs.zip
(we know that for real performance numbers, we should implement an environment specific mapper)
It would be great if you could help us, we’d like to compare our legion implementation with our OpenMP and MPI implementation.
Thanks 🙂