Closed mariodirenzo closed 5 months ago
Something bad is happening in the network. Please regenerate the profiles with the Rust profiler. It is probably too late for this to make the March release.
@lightsighter These are the raw log files from each run. https://drive.google.com/drive/folders/1YqqTWMqo7d1JJyXqAg9W0oP1hdojhYoN?usp=share_link
These profiles were not generated for the most recent version of the profiler. You either need to regenerate them or provide the precise Legion commit hash that you used to generate them.
Legion commit hash from 30 January 2023: cba415a857c2586b2ad2f4848d6d1cd75de7df00
Legion commit hash from 28 January 2024: 91b55ce217933ec6cd546c0731a6652031dfc7e9
Those are both going to be too old to show the information we need. Please generate a slow profile from the most recent Legion master branch.
Also, please make sure all profiles are generated with the same version of GASNetEx.
I'm assuming the slow profile is generated using the python file. I don't exactly know how to make sure they're generated using the same version of GASNetEx; any hints on that? Thanks.
I'm saying you need to generate log files from a run using the most recent Legion master branch as they will have considerably more data about the timing of various operations.
I don't exactly know how to make sure they're generated using the same version of GASNetEx; any hints on that?
Build and link against this version of GASNetEX from here. I'm not sure which build system you are using so it is up to you to integrate that into your build system.
If you use setup_env.py
this version will get pulled in automatically when you build the latest Legion.
Thanks @elliottslaughter for the tip 👍🏼 @lightsighter I've uploaded the profiles in the same directory. Both legion versions use now the GASNetEX you advised. The log files corresponding to the run which takes too long to get started uses the most recent Legion master branch.
@elliottslaughter What happened to the dependent partitioning channel in the profiler?
@albovoci Have you marked all your partitions disjoint/aliased and complete/incomplete?
I pushed a fix for the profiler here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1169
Updated profile here: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~eslaught/bug1652_7541cd950_gasnet_2023_9_0/
I know the partitions are marked as disjoint/aliased; I'm not so sure about complete/incomplete. Maybe @mariodirenzo can clarify that better.
No, we do not mark the regions as complete/incomplete as we do not know this attribute at compile time and it depends on the input of the calculation
Run with -ll:force_kthreads
and capture order ten periodic backtraces of all the threads on all the nodes during the first minute of the run and report them here.
Sorry, I'm kind of new to this :/ How do I "capture order ten periodic backtraces of all the threads on all the nodes"?
Here's an example script for you:
#!/bin/bash
set -e
"$@" &
pid=$!
for i in {1 .. 10}; do
sleep 6
gdb -p $pid -ex 'set confirm off' -ex 'set height 0' -ex 'set width 0' -ex 'thread apply all backtrace' -ex 'quit'
end
# wait for backtraces to complete and then kill the process
if [[ ${KILL_PROCESS_AFTER_BACKTRACE:-1} = 1 ]]; then
sleep 2m
kill $pid
sleep 1m
kill -9 $pid
fi
Run it like this:
srun ... ./run_with_backtraces.sh ./my_app ...
You might want to redirect the output of each rank to a separate file to make sure they stay separate. Modify as necessary for your particular system.
@albovoci - which version of HTR are you using to reproduce this?
@seemamirch It's this commit: 2198880d
Backtraces (10 per run with 1 every 10 seconds) are on sapling /scratch2/seemah/htr_issue_1652/results_O2_g
Legion commit hash from 30 January 2023: cba415a857c2586b2ad2f4848d6d1cd75de7df00 -> output_fileold
Can we do these runs without -lg:safe_ctrlrepl
? That is known to cause huge amounts of blocking in the top-level task and are messing with the results. Unless of course the original runs were done with -lg:safe_ctrlrepl
in which case the solution is going to be to not do that for performance runs.
If you do re-run, you just need to do the newer commit and not the older one.
@elliottslaughter Why would Regent be attaching semantic tags other than for semantic tag 0 (the name tag). For example,
#12 0x0000200000bad81c in Legion::Internal::Runtime::attach_semantic_information (this=0x194713b0, handle=..., fid=<optimized out>, tag=54321,
buffer=0x2031d5824f30, size=4, is_mutable=<optimized out>) at /usr/WS1/mirchandaney1/legion_new/runtime/legion/runtime.cc:20212
#13 0x00002000006ea4fc in Legion::Runtime::attach_semantic_information (this=<optimized out>, handle=..., fid=<optimized out>, tag=<optimized o
ut>, buffer=<optimized out>, size=<optimized out>, is_mut=<optimized out>) at /usr/WS1/mirchandaney1/legion_new/runtime/legion/legion.cc:7386
#14 0x000020000070f920 in legion_field_id_attach_semantic_information (runtime_=..., handle_=..., id=<optimized out>, tag=54321, buffer=0x2031d
5824f30, size=4, is_mutable=<optimized out>) at /usr/WS1/mirchandaney1/legion_new/runtime/legion/legion_c.cc:2064
#15 0x000000001018f8b4 in $<workSingle> ()
#16 0x00000000101281b4 in $__regent_task_workSingle_primary ()
Why would there be 54321 semantic tags?
There are not 54321 semantic tags. It is a magic number:
https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/std_base.t#L497
And the top of that block explains:
https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/std_base.t#L480
@lightsighter Backtraces (10 per run with 1 every 10 seconds) without -lg:safe_ctrlrepl are on sapling /scratch2/seemah/htr_issue_1652/results_O2_g_no_safe_ctrlrepl. The performance runs don't have this option. 8 node runs failed, the others completed I also see these warnings i.e. output_file_8.1:[1 - 20406181f8b0] 4.605703 {4}{runtime}: [warning 1114] LEGION WARNING: Failed to find a refinement for KD tree with 3 dimensions and 21 rectangles. Please report your application to the Legion developers' mailing list. (from file /usr/WS1/mirchandaney1/legion_new/runtime/legion/region_tree.inl:6446)
The only major thing that the workSingle
task seems to be waiting on are futures. This means that the workSingle
task is waiting on at least hundred if not thousands of futures every single run (as evidenced by all the different waits in the profile)., especially at the start. If it was waiting on something else more frequently that would show up in the backtraces, but I only see a few semantic attach waits, whereas the future wait calls are 10X more prominent. It's still possible that semantic attach operations are the issue and the samples of the backtraces just happen latter in the run. It's hard to tell without timestamps. @elliottslaughter can we disable all the semantic attach operations? Can we also see why we might be waiting on lots of futures in Regent tasks?
8 node runs failed, the others completed
I wouldn't worry about any crashes. Even the new branch is two months old and there have been many bug fixes since then.
output_file_8.1:[1 - 20406181f8b0] 4.605703 {4}{runtime}: [warning 1114] LEGION WARNING: Failed to find a refinement for KD tree with 3 dimensions and 21 rectangles. Please report your application to the Legion developers' mailing list. (from file /usr/WS1/mirchandaney1/legion_new/runtime/legion/region_tree.inl:6446)
Mario and I already discussed those and they are the result of an unusual aliased partition that is being made. You don't need to worry about them and they are not going to be responsible for the profiling effects we're observing. They would show up very loudly as long running meta-tasks and would be prominently visible in profiles if they were a problem.
I want to remind here that we are using the same version of HTR to test the old and new versions of Legion. The application asks Regent and the runtime the same semantic-attach and future-wait operations regardless of what version of the runtime we are using.
can we disable all the semantic attach operations?
We need that semantic information in the mapper to perform our mapping decisions.
Can we also see why we might be waiting on lots of futures in Regent tasks?
We are waiting for some futures in the top-level task and the only workaround is to use predication. Unfortunately, predication is not compatible with tracing so, if we gain some performance on one side, we lose it on the other.
For what it's worth, I first added that semantic information in 8398065f88234674713aa22b0a9732dc014d04a3, June of 2022. So, in the time frame under discussion in this issue I believe the semantic information should be identical (or at least highly similar).
If you do want to run an experiment with this disabled, here are the lines to comment out:
The application asks Regent and the runtime the same semantic-attach and future-wait operations regardless of what version of the runtime we are using.
I understand. I want to remove the semantic attaches to see if a change in the implementation of them causes the performance degradation or whether it was something else.
We need that semantic information in the mapper to perform our mapping decisions.
That seems pretty dicey. That's not really what the semantic information interface is for. It's more for logging and debugging, not for correctness. I suppose you can use it for that, but that's not what is designed for and I can't promise it will be fast enough for all those cases.
We are waiting for some futures in the top-level task and the only workaround is to use predication. Unfortunately, predication is not compatible with tracing so, if we gain some performance on one side, we lose it on the other.
Let's turn off tracing and get rid of the waits and see if the profile is still "empty" (I don't care if it is slow, but I want to see the runtime busy doing all sorts of things). We don't have a good sense of what is causing all the waiting right now in the worker task so we need to start removing all the sources of blocking in the application side.
For what it's worth, I first added that semantic information in https://github.com/StanfordLegion/legion/commit/8398065f88234674713aa22b0a9732dc014d04a3, June of 2022
I don't care about that specific one. I care about all of the semantic info attached by Regent or by users. I don't think anyone semantic info is the cause of the problem, but I want to start removing sources of blocking because the backtrace experiment was inconclusive.
I told Mike already, but I was able to reproduce the performance issues on 8 and 16 nodes of Lassen, but was not able to reproduce the same problems on Perlmutter, indicating that something might going on inside the network.
@mariodirenzo HTR uses the following 2 options - -ll:cpu_bgwork 100 -ll:util_bgwork 100 (time slicing util/cpu threads) Without those options -> the first iteration completes in 2.9 seconds for 8 nodes, 14 seconds for 16 nodes for the weakScaling test (as per console.txt) With these options -> the first iteration completes in 16.045 seconds for 8 nodes, 79 seconds for 16 nodes Both results are for a newer version of legion (master branch 410d81239 ) I'm not sure why you have set these options
Those options used to help performance as CPU and util threads are usually idle during the execution. Is the performance without the flags in line with the numbers achieved with the old version of Legion?
Yes + overall new legion perf is better - I compared with the same 2 legion versions you mentioned in this issue new legion, 16 nodes, 101 iterations, new options -> 140 seconds old legion, 16 nodes, 101 iterations, new options -> 168.9 seconds old legion, 16 nodes, 101 iterations, old options -> 193 seconds
Background worker (bgwork) threads are set to 16 already - it's not helping to have those options enabled.
Profiles for an HTR test on Lassen, 1 node -ll:cpu_bgwork 100 -ll:util_bgwork 100 (slow version) https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/accessor_gpu_128_sched/
without these options (fast version) https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/accessor_gpu_128_no_sched/
FB to FB copies appear to be taking much longer (n0f2-n0f2 channel)
Was there a conclusion on this? Was the issue fully resolved by removing the -ll:cpu_bgwork
and -ll:util_bgwork
flags?
These flags have been removed from the HTR scripts. @mariodirenzo to confirm performance is now better. It's unclear though why the FB to FB copies are so much slower with these options
The issue is solved by removing those flags but, as Seema mentioned, the reason is unclear. I am not particularly attached to those flags so it is up to you to decide if this is a satisfactory solution (and close the issue) or if you want to further investigate what was going on
@elliottslaughter Did we add some kind of detection for this in Legion Prof? I thought you might have done that but I can't remember.
I'm not sure we have a root cause yet, so it does not make sense to add any detection.
My understanding from the above discussion is that HTR was running with -ll:bgwork 16
(i.e., 16 background worker threads) as well as with -ll:cpu_bgwork 100 -ll:util_bgwork 100
.
There are two possible hypotheses I think we can form based on this:
Both of these hypotheses seem plausible to me. I suppose we could do an experiment with -ll:cpu_bgwork 500 -ll:util_bgwork 500
to see if that changes the observed behavior. At any rate, without getting to a root diagnosis I don't think there is anything to really do in terms of trying to catch this automatically.
One thing I think we should do is track the core binding in the profiler. I.e., we want to show a diagram similar to -ll:show_rsrv
so that the user knows how Realm detected the machine and what processors were assigned to what cores. That doesn't automatically detect issues like this but would potentially let us (especially core maintainers helping users) diagnose certain kinds of issues like these ones (along with more common cases like binding all the processors to a single CPU core).
A correction: My hypothesis (2) appears to be incorrect. Realm micro-ops do NOT get split depending on the value of -ll:cpu_bgwork
or -ll:util_bgwork
. Instead, these values work as a filter: CPU/Util processors simply will not run background work items estimated to be larger than the threshold.
(Having said that, I'm not sure that Realm goes to any particular effort to estimate the cost of micro-ops, so it's possible that this effectively turns into a binary flag. Either the CPU/Util runs background work items, or it does not.)
Therefore, nothing about this setting would result in overheads increasing on a per-micro-op basis as the micro-ops themselves are untouched.
So then the hypotheses that remain are either (1) that the additional threads introduce contention (as mentioned above), or (3) that having CPUs/Utils involved in background work either delays higher-priority CPU/Util tasks or else somehow interferes with the CPU/Util processors' ability to run such tasks (e.g., because more time is spent in lock contention).
I think (3) is more likely because we've turned over the responsibility of the thread scheduling to the OS (probably Linux) and it's going to do it's fair-share scheduling thing which will round-robin between the background worker threads. If it has to do that for all the background worker threads, each of which is going to check queues for stuff to do, before getting back to the one thread that happens to have the interesting bit of work needed to make forward progress, that is going to cause massive slowdowns.
Ok. To answer @mariodirenzo's question from a couple comments back: I think this is not a problem we need to solve. Legion does not perform well when you oversubscribe the machine. But the solution to this is easy: stop allocating so many threads. The only time when this really causes problems is when you have a CPU-only machine where you are trying to squeeze maximum FLOPs out of the system. But this is not a scenario we anticipate in any major upcoming system (aside from possibly some of the Japanese supercomputers).
So overall my guidance would be:
-ll:bgwork
. This value should be tuned for the machine and you should expect a local maximum at somewhere less than the theoretical max core count.-ll:bgwork
and also set -ll:cpu_bgwork 100 -ll:util_bgwork 100
at the same time. The high bgwork
count makes it unnecessary for the CPUs to do anything. In fact, I would just avoid -ll:cpu_bgwork 100 -ll:util_bgwork 100
because the situations where you need it are likely to be very niche (i.e., CPU-only systems where you're trying to maximize application task throughput).-ll:bgwork
. I saw a 2× speedup with S3D running 2 ranks per node, for example. The main thing is to avoid overcrowding as you do so: you generally want to leave plenty of room for Legion to do its analysis.
As mentioned in a previous Legion meeting, we are seeing a significant degradation of performance when running HTR. We are comparing the performance of an old commit (cba415a85) to the performance of a new commit (91b55ce21) when weak scaling HTR. Note that we are using the same version of HTR to perform this test.
For instance, we are reporting here the profiles obtained on 8 nodes with the old commit and with the new commit.
In this profiles we see a minor degradation of the GPU usage (that reflects in about a 10% increase of the time per step) but, more importantly, a very large section of the profile when it seems that nothing is happening at the beginning of the run. This increase of "idle" time at the beginning seems to increase with some power of the number of nodes used in the calculations.
Adding @lightsighter and @seemamirch for visibility
@elliottslaughter, can you add this issue to #1032 ? Is it too late to insert this issue in the list tracked for the March release?