Performance degradation when running HTR

mariodirenzo commented 8 months ago

As mentioned in a previous Legion meeting, we are seeing a significant degradation of performance when running HTR. We are comparing the performance of an old commit (cba415a85) to the performance of a new commit (91b55ce21) when weak scaling HTR. Note that we are using the same version of HTR to perform this test.

For instance, we are reporting here the profiles obtained on 8 nodes with the old commit and with the new commit.

In this profiles we see a minor degradation of the GPU usage (that reflects in about a 10% increase of the time per step) but, more importantly, a very large section of the profile when it seems that nothing is happening at the beginning of the run. This increase of "idle" time at the beginning seems to increase with some power of the number of nodes used in the calculations.

Adding @lightsighter and @seemamirch for visibility

@elliottslaughter, can you add this issue to #1032 ? Is it too late to insert this issue in the list tracked for the March release?

lightsighter commented 8 months ago

Something bad is happening in the network. Please regenerate the profiles with the Rust profiler. It is probably too late for this to make the March release.

albovoci commented 8 months ago

@lightsighter These are the raw log files from each run. https://drive.google.com/drive/folders/1YqqTWMqo7d1JJyXqAg9W0oP1hdojhYoN?usp=share_link

lightsighter commented 8 months ago

These profiles were not generated for the most recent version of the profiler. You either need to regenerate them or provide the precise Legion commit hash that you used to generate them.

albovoci commented 8 months ago

Legion commit hash from 30 January 2023: cba415a857c2586b2ad2f4848d6d1cd75de7df00 Legion commit hash from 28 January 2024: 91b55ce217933ec6cd546c0731a6652031dfc7e9

lightsighter commented 8 months ago

Those are both going to be too old to show the information we need. Please generate a slow profile from the most recent Legion master branch.

lightsighter commented 8 months ago

Also, please make sure all profiles are generated with the same version of GASNetEx.

albovoci commented 8 months ago

I'm assuming the slow profile is generated using the python file. I don't exactly know how to make sure they're generated using the same version of GASNetEx; any hints on that? Thanks.

lightsighter commented 8 months ago

I'm saying you need to generate log files from a run using the most recent Legion master branch as they will have considerably more data about the timing of various operations.

I don't exactly know how to make sure they're generated using the same version of GASNetEx; any hints on that?

Build and link against this version of GASNetEX from here. I'm not sure which build system you are using so it is up to you to integrate that into your build system.

elliottslaughter commented 8 months ago

If you use setup_env.py this version will get pulled in automatically when you build the latest Legion.

albovoci commented 8 months ago

Thanks @elliottslaughter for the tip 👍🏼 @lightsighter I've uploaded the profiles in the same directory. Both legion versions use now the GASNetEX you advised. The log files corresponding to the run which takes too long to get started uses the most recent Legion master branch.

lightsighter commented 8 months ago

@elliottslaughter What happened to the dependent partitioning channel in the profiler?

lightsighter commented 8 months ago

@albovoci Have you marked all your partitions disjoint/aliased and complete/incomplete?

elliottslaughter commented 8 months ago

I pushed a fix for the profiler here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1169

Updated profile here: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~eslaught/bug1652_7541cd950_gasnet_2023_9_0/

albovoci commented 8 months ago

I know the partitions are marked as disjoint/aliased; I'm not so sure about complete/incomplete. Maybe @mariodirenzo can clarify that better.

mariodirenzo commented 8 months ago

No, we do not mark the regions as complete/incomplete as we do not know this attribute at compile time and it depends on the input of the calculation

lightsighter commented 8 months ago

Run with -ll:force_kthreads and capture order ten periodic backtraces of all the threads on all the nodes during the first minute of the run and report them here.

albovoci commented 8 months ago

Sorry, I'm kind of new to this :/ How do I "capture order ten periodic backtraces of all the threads on all the nodes"?

elliottslaughter commented 8 months ago

Here's an example script for you:

#!/bin/bash

set -e

"$@" &
pid=$!

for i in {1 .. 10}; do
  sleep 6
  gdb -p $pid -ex 'set confirm off' -ex 'set height 0' -ex 'set width 0' -ex 'thread apply all backtrace' -ex 'quit'
end

# wait for backtraces to complete and then kill the process
if [[ ${KILL_PROCESS_AFTER_BACKTRACE:-1} = 1 ]]; then
    sleep 2m
    kill $pid
    sleep 1m
    kill -9 $pid
fi

Run it like this:

srun ... ./run_with_backtraces.sh ./my_app ...

You might want to redirect the output of each rank to a separate file to make sure they stay separate. Modify as necessary for your particular system.

seemamirch commented 8 months ago

@albovoci - which version of HTR are you using to reproduce this?

albovoci commented 8 months ago

@seemamirch It's this commit: 2198880d

seemamirch commented 7 months ago

Backtraces (10 per run with 1 every 10 seconds) are on sapling /scratch2/seemah/htr_issue_1652/results_O2_g Legion commit hash from 30 January 2023: cba415a857c2586b2ad2f4848d6d1cd75de7df00 -> output_fileold.<node#> Legion commit hash from 28 January 2024: 91b55ce217933ec6cd546c0731a6652031dfc7e9 -> outputfile.<node#>

lightsighter commented 7 months ago

Can we do these runs without -lg:safe_ctrlrepl? That is known to cause huge amounts of blocking in the top-level task and are messing with the results. Unless of course the original runs were done with -lg:safe_ctrlrepl in which case the solution is going to be to not do that for performance runs.

lightsighter commented 7 months ago

If you do re-run, you just need to do the newer commit and not the older one.

lightsighter commented 7 months ago

@elliottslaughter Why would Regent be attaching semantic tags other than for semantic tag 0 (the name tag). For example,

#12 0x0000200000bad81c in Legion::Internal::Runtime::attach_semantic_information (this=0x194713b0, handle=..., fid=<optimized out>, tag=54321, 
buffer=0x2031d5824f30, size=4, is_mutable=<optimized out>) at /usr/WS1/mirchandaney1/legion_new/runtime/legion/runtime.cc:20212
#13 0x00002000006ea4fc in Legion::Runtime::attach_semantic_information (this=<optimized out>, handle=..., fid=<optimized out>, tag=<optimized o
ut>, buffer=<optimized out>, size=<optimized out>, is_mut=<optimized out>) at /usr/WS1/mirchandaney1/legion_new/runtime/legion/legion.cc:7386
#14 0x000020000070f920 in legion_field_id_attach_semantic_information (runtime_=..., handle_=..., id=<optimized out>, tag=54321, buffer=0x2031d
5824f30, size=4, is_mutable=<optimized out>) at /usr/WS1/mirchandaney1/legion_new/runtime/legion/legion_c.cc:2064
#15 0x000000001018f8b4 in $<workSingle> ()
#16 0x00000000101281b4 in $__regent_task_workSingle_primary ()

Why would there be 54321 semantic tags?

elliottslaughter commented 7 months ago

There are not 54321 semantic tags. It is a magic number:

https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/std_base.t#L497

And the top of that block explains:

https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/std_base.t#L480

seemamirch commented 7 months ago

@lightsighter Backtraces (10 per run with 1 every 10 seconds) without -lg:safe_ctrlrepl are on sapling /scratch2/seemah/htr_issue_1652/results_O2_g_no_safe_ctrlrepl. The performance runs don't have this option. 8 node runs failed, the others completed I also see these warnings i.e. output_file_8.1:[1 - 20406181f8b0] 4.605703 {4}{runtime}: [warning 1114] LEGION WARNING: Failed to find a refinement for KD tree with 3 dimensions and 21 rectangles. Please report your application to the Legion developers' mailing list. (from file /usr/WS1/mirchandaney1/legion_new/runtime/legion/region_tree.inl:6446)

lightsighter commented 7 months ago

The only major thing that the workSingle task seems to be waiting on are futures. This means that the workSingle task is waiting on at least hundred if not thousands of futures every single run (as evidenced by all the different waits in the profile)., especially at the start. If it was waiting on something else more frequently that would show up in the backtraces, but I only see a few semantic attach waits, whereas the future wait calls are 10X more prominent. It's still possible that semantic attach operations are the issue and the samples of the backtraces just happen latter in the run. It's hard to tell without timestamps. @elliottslaughter can we disable all the semantic attach operations? Can we also see why we might be waiting on lots of futures in Regent tasks?

8 node runs failed, the others completed

I wouldn't worry about any crashes. Even the new branch is two months old and there have been many bug fixes since then.

output_file_8.1:[1 - 20406181f8b0] 4.605703 {4}{runtime}: [warning 1114] LEGION WARNING: Failed to find a refinement for KD tree with 3 dimensions and 21 rectangles. Please report your application to the Legion developers' mailing list. (from file /usr/WS1/mirchandaney1/legion_new/runtime/legion/region_tree.inl:6446)

Mario and I already discussed those and they are the result of an unusual aliased partition that is being made. You don't need to worry about them and they are not going to be responsible for the profiling effects we're observing. They would show up very loudly as long running meta-tasks and would be prominently visible in profiles if they were a problem.

mariodirenzo commented 7 months ago

I want to remind here that we are using the same version of HTR to test the old and new versions of Legion. The application asks Regent and the runtime the same semantic-attach and future-wait operations regardless of what version of the runtime we are using.

can we disable all the semantic attach operations?

We need that semantic information in the mapper to perform our mapping decisions.

Can we also see why we might be waiting on lots of futures in Regent tasks?

We are waiting for some futures in the top-level task and the only workaround is to use predication. Unfortunately, predication is not compatible with tracing so, if we gain some performance on one side, we lose it on the other.

elliottslaughter commented 7 months ago

For what it's worth, I first added that semantic information in 8398065f88234674713aa22b0a9732dc014d04a3, June of 2022. So, in the time frame under discussion in this issue I believe the semantic information should be identical (or at least highly similar).

If you do want to run an experiment with this disabled, here are the lines to comment out:

https://github.com/StanfordLegion/legion/blob/c61071541218747e35767317f6f89b83f374f264/language/src/regent/codegen.t#L4715-L4722

lightsighter commented 7 months ago

The application asks Regent and the runtime the same semantic-attach and future-wait operations regardless of what version of the runtime we are using.

I understand. I want to remove the semantic attaches to see if a change in the implementation of them causes the performance degradation or whether it was something else.

We need that semantic information in the mapper to perform our mapping decisions.

That seems pretty dicey. That's not really what the semantic information interface is for. It's more for logging and debugging, not for correctness. I suppose you can use it for that, but that's not what is designed for and I can't promise it will be fast enough for all those cases.

We are waiting for some futures in the top-level task and the only workaround is to use predication. Unfortunately, predication is not compatible with tracing so, if we gain some performance on one side, we lose it on the other.

Let's turn off tracing and get rid of the waits and see if the profile is still "empty" (I don't care if it is slow, but I want to see the runtime busy doing all sorts of things). We don't have a good sense of what is causing all the waiting right now in the worker task so we need to start removing all the sources of blocking in the application side.

For what it's worth, I first added that semantic information in https://github.com/StanfordLegion/legion/commit/8398065f88234674713aa22b0a9732dc014d04a3, June of 2022

I don't care about that specific one. I care about all of the semantic info attached by Regent or by users. I don't think anyone semantic info is the cause of the problem, but I want to start removing sources of blocking because the backtrace experiment was inconclusive.

rohany commented 7 months ago

I told Mike already, but I was able to reproduce the performance issues on 8 and 16 nodes of Lassen, but was not able to reproduce the same problems on Perlmutter, indicating that something might going on inside the network.

seemamirch commented 7 months ago

@mariodirenzo HTR uses the following 2 options - -ll:cpu_bgwork 100 -ll:util_bgwork 100 (time slicing util/cpu threads) Without those options -> the first iteration completes in 2.9 seconds for 8 nodes, 14 seconds for 16 nodes for the weakScaling test (as per console.txt) With these options -> the first iteration completes in 16.045 seconds for 8 nodes, 79 seconds for 16 nodes Both results are for a newer version of legion (master branch 410d81239 ) I'm not sure why you have set these options

mariodirenzo commented 7 months ago

Those options used to help performance as CPU and util threads are usually idle during the execution. Is the performance without the flags in line with the numbers achieved with the old version of Legion?

seemamirch commented 7 months ago

Yes + overall new legion perf is better - I compared with the same 2 legion versions you mentioned in this issue new legion, 16 nodes, 101 iterations, new options -> 140 seconds old legion, 16 nodes, 101 iterations, new options -> 168.9 seconds old legion, 16 nodes, 101 iterations, old options -> 193 seconds

Background worker (bgwork) threads are set to 16 already - it's not helping to have those options enabled.

seemamirch commented 6 months ago

Profiles for an HTR test on Lassen, 1 node -ll:cpu_bgwork 100 -ll:util_bgwork 100 (slow version) https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/accessor_gpu_128_sched/

without these options (fast version) https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/accessor_gpu_128_no_sched/

FB to FB copies appear to be taking much longer (n0f2-n0f2 channel)

elliottslaughter commented 5 months ago

Was there a conclusion on this? Was the issue fully resolved by removing the -ll:cpu_bgwork and -ll:util_bgwork flags?

seemamirch commented 5 months ago

These flags have been removed from the HTR scripts. @mariodirenzo to confirm performance is now better. It's unclear though why the FB to FB copies are so much slower with these options

mariodirenzo commented 5 months ago

The issue is solved by removing those flags but, as Seema mentioned, the reason is unclear. I am not particularly attached to those flags so it is up to you to decide if this is a satisfactory solution (and close the issue) or if you want to further investigate what was going on

lightsighter commented 5 months ago

@elliottslaughter Did we add some kind of detection for this in Legion Prof? I thought you might have done that but I can't remember.

elliottslaughter commented 5 months ago

I'm not sure we have a root cause yet, so it does not make sense to add any detection.

My understanding from the above discussion is that HTR was running with -ll:bgwork 16 (i.e., 16 background worker threads) as well as with -ll:cpu_bgwork 100 -ll:util_bgwork 100.

There are two possible hypotheses I think we can form based on this:

The overhead was the result of lock contention that was exacerbated by dramatically increasing the number of threads involved in background work. (Frankly, 16 is already too many, and then if you add all the CPUs and all the Utils on top, it's just way too much.)
The timeslice value of 100 for the CPU and Util processors was too small and was causing Realm overheads to be evident. If I understand correctly, the units here are microseconds. Based on Task Bench we know Realm can't really run smaller than 50 microsecond tasks. So at 100 microseconds it's possible that we were simply introducing dramatically higher overheads by forcing the micro-ops to be broken into much smaller pieces than usual.

Both of these hypotheses seem plausible to me. I suppose we could do an experiment with -ll:cpu_bgwork 500 -ll:util_bgwork 500 to see if that changes the observed behavior. At any rate, without getting to a root diagnosis I don't think there is anything to really do in terms of trying to catch this automatically.

One thing I think we should do is track the core binding in the profiler. I.e., we want to show a diagram similar to -ll:show_rsrv so that the user knows how Realm detected the machine and what processors were assigned to what cores. That doesn't automatically detect issues like this but would potentially let us (especially core maintainers helping users) diagnose certain kinds of issues like these ones (along with more common cases like binding all the processors to a single CPU core).

elliottslaughter commented 5 months ago

A correction: My hypothesis (2) appears to be incorrect. Realm micro-ops do NOT get split depending on the value of -ll:cpu_bgwork or -ll:util_bgwork. Instead, these values work as a filter: CPU/Util processors simply will not run background work items estimated to be larger than the threshold.

(Having said that, I'm not sure that Realm goes to any particular effort to estimate the cost of micro-ops, so it's possible that this effectively turns into a binary flag. Either the CPU/Util runs background work items, or it does not.)

Therefore, nothing about this setting would result in overheads increasing on a per-micro-op basis as the micro-ops themselves are untouched.

So then the hypotheses that remain are either (1) that the additional threads introduce contention (as mentioned above), or (3) that having CPUs/Utils involved in background work either delays higher-priority CPU/Util tasks or else somehow interferes with the CPU/Util processors' ability to run such tasks (e.g., because more time is spent in lock contention).

lightsighter commented 5 months ago

I think (3) is more likely because we've turned over the responsibility of the thread scheduling to the OS (probably Linux) and it's going to do it's fair-share scheduling thing which will round-robin between the background worker threads. If it has to do that for all the background worker threads, each of which is going to check queues for stuff to do, before getting back to the one thread that happens to have the interesting bit of work needed to make forward progress, that is going to cause massive slowdowns.

elliottslaughter commented 5 months ago

Ok. To answer @mariodirenzo's question from a couple comments back: I think this is not a problem we need to solve. Legion does not perform well when you oversubscribe the machine. But the solution to this is easy: stop allocating so many threads. The only time when this really causes problems is when you have a CPU-only machine where you are trying to squeeze maximum FLOPs out of the system. But this is not a scenario we anticipate in any major upcoming system (aside from possibly some of the Japanese supercomputers).

So overall my guidance would be:

In general you don't want to blindly maximize -ll:bgwork. This value should be tuned for the machine and you should expect a local maximum at somewhere less than the theoretical max core count.
It doesn't make sense to both increase -ll:bgwork and also set -ll:cpu_bgwork 100 -ll:util_bgwork 100 at the same time. The high bgwork count makes it unnecessary for the CPUs to do anything. In fact, I would just avoid -ll:cpu_bgwork 100 -ll:util_bgwork 100 because the situations where you need it are likely to be very niche (i.e., CPU-only systems where you're trying to maximize application task throughput).
You might consider increasing ranks per node. On a system with this many cores, various overheads are likely to increase with the number of cores per rank. So running 2 or 4 ranks per node may reduce those overheads more effectively than increasing -ll:bgwork. I saw a 2× speedup with S3D running 2 ranks per node, for example. The main thing is to avoid overcrowding as you do so: you generally want to leave plenty of room for Legion to do its analysis.

StanfordLegion / legion

Performance degradation when running HTR #1652