HPCToolkit / hpctoolkit

HPCToolkit performance tools: measurement and analysis components
BSD 3-Clause "New" or "Revised" License
333 stars 60 forks source link

Design & Implement per-Thread Metric Blaming #492

Closed blue42u closed 2 years ago

blue42u commented 2 years ago

Background

Performance metrics by nature are "symptomatic," they get attributed to where a performance characteristic arises, for instance REALTIME is attributed to the instructions that are taking time. Usually once identifying a performance issue/hotspot, the next step is to manually determine the root cause of the issue and then remove it.

Sometimes we can do better than that, we can instead identify instruction-level interactions with binary analysis and "blame" the cause rather than the symptom. Sometimes this is required to get reasonable performance data for an architecture, for instance pipeline-wait instructions (common in GPU architectures) are never the root cause of a performance problem. In theory, doing "blaming" should significantly improve the quality of the performance analysis we can provide.

Motivating Initial Questions

Ping @aarontcopal2 for comments since this is in part motivated by the Intel GPU front.

mxz297 commented 2 years ago

I agree with the basic idea and want to add my thoughts here:

  1. I don't know whether it is always reasonable or possible to do blame shifting on every metric. An instruction that incurred a REALTIME metric could mean it is indeed executed quite often and efficiently (in this case, we cannot really blame other instructions) or it is stalled quite often, which leads to a high REALTIME metric value (in this case, we may or may not be able figure out why this instruction is stalled purely based on the instruction itself). Some metrics, by definitions, are easier to do blame than others, such as memory stalls. Time measurement such as REALTIME or CYCLES are not easy to do blame shifting, and I think that's why we in general complements time based measurements with stall based metrics.

  2. When we talk about "blame", we often mean instruction level blame, like the ones we discovered through backward slicing. Such blame is good for memory stalls and instruction dependency stalls. It may not necessarily be useful for other instruction latencies, such as instruction cache misses. If an instruction caused an i-cache miss, I don't feel we can blame any particular other single instruction for this. Maybe this is still a case, where i-cache miss is still a metric that we cannot do blame.

  3. Regarding when the blame should be done, I think we have two general cases: (1) the blame has to be done at hpcrun time when measure the program (an example is the GPU IDLENESS metric where we at runtime maintain the number of current GPU operations and blame threads when the count is zero) and (2) the blame is done post-mortem where we just record the metric and its PC at hpcrun time and then use whatever binary analysis needed post-mortem. For type (1), we have a much tighter budget on what we can do in terms of software dependencies and how much time we can slowdown the application. For type (2), we can leverage whatever parallelism we can think of to speed things up.

  4. Regarding whether blame is optional or required, or whether unblamed data is needed, I am in a position to believe that no one really knows the answer and my current is "maybe".... The actual performance diagnosis process is still and will continue to be complex and adhoc. We do not know when we will need what, and then how to fix a problem. I think we will need several years of work to figure out a more principled way of diagnosing performance problems, especially for GPUs.

blue42u commented 2 years ago

Thanks for the comments! I agree with @mxz297's points, my own comments below:

  1. I'm assuming not all metrics will be blamed, so far I think whether a metric can be blamed depends on the metric, binary architecture, and possibly microarchitecture of the processor. There might be some surprising cases: consider an add instruction with no memory access, it should take a single cycle (on most processors) so cycles - samples maybe can be blamed back to cycles on its instruction dependencies. On the other hand, an L1I-cache miss doesn't have anyone else to blame other than the instruction itself, and REALTIME could be caused by any number of things so its not known what to blame.
  2. The only kind of "blaming" I'm considering in this issue is based on instruction or memory dependencies (sorry for not being clear up front), this only occurs within single threads (per-Thread) and relies on binary analysis, so type (2). There are some cases where we can generate type (1) blaming post-mortem (e.g. the GPU blame analysis in the trace view), I'm not considering that either.
  3. I agree this is a big "maybe," the only reason I bring it up now is because of @aarontcopal2's work and because it's a lot easier to implement later when the pieces are in place early. It may very well be that programmatic blaming won't be possible in enough cases to make it useful for human inspection. It would be nice to have a pilot study, preferably from someone who actually knows what they're doing in this direction.
aarontcopal2 commented 2 years ago

@blue42u @mxz297

I have some comments to add to this

Yes, the blaming approach should be postmortem. In the current prototype version of latency blaming, we perform backward-slicing on the gpubins inside hpcstruct. For large applications such as AMR-Wind, backward-slicing on large gpubins takes hours. The resulting def-use graph is saved in a file within the measurements directory. I had a design suggestion. Should these def-use files be outputted in a separate directory(directory def_use) outside measurements? If we do this, the user need only run the blaming mode whenever the GPU kernel code is updated. Then subsequent runs of hpcrun can be paired with the def_use directory.

blue42u commented 2 years ago

For large applications such as AMR-Wind, backward-slicing on large gpubins takes hours.

Hours seems like a lot, do you know where the time is going? Can it be easily accelerated with some parallelism?

Should these def-use files be outputted in a separate directory(directory def_use) outside measurements? If we do this, the user need only run the blaming mode whenever the GPU kernel code is updated. Then subsequent runs of hpcrun can be paired with the def_use directory.

I'm thinking this method is a bit overcomplicated, we've discussed having an hpcstruct cache before which would be much simpler. Then the user could just always run hpcstruct and not worry about whether things have changed or not, and gain the same time savings.

aarontcopal2 commented 2 years ago

The aggregate def-use computation takes >1hr. IIRC, the largest gpubin took 10-15min (after leveraging parallelism).

Also, a hpcstruct cache does seem like a better solution. Is there any pointer for past discussions on this?

blue42u commented 2 years ago

I see, so it's more that there are a lot of gpubins and def-use takes a while for any one of them.

I think the Structfile cache was mentioned at one of the group meetings but never made it to a proper issue. IIRC there wasn't any previous discussion other than "that sounds useful, we should do that."

aarontcopal2 commented 2 years ago

@blue42u Although the discussion is generic, I can add a few points pertaining to latency-blaming:

Should we present the original "unblamed" metric values as a separate metric column, or does it make sense to only present the "blamed" values?

I think both blamed and unblamed metrics should be visible. If the unblamed metrics should be hidden, the show flag for this metric should be set to false in hpcrun.

Is "blaming" optional or required? Should we error if the suitable binary analysis isn't available (i.e. missing a Structfile)?

Blaming is an expensive operation, so it should only be turned on request. Passing flags as inputs can be one way of determining which modes of blaming need to be turned on inside prof2.

The blaming calculations involve a (complex) formula which may need multiple metrics as input and def-use graph relation (for instruction level blaming). I am not sure how to capture such formulae in a generic fashion. I have attached below the formula used for latency blaming just to give an example of the kind of calculation involved image

jmellorcrummey commented 2 years ago

The structure cache is largely implemented #507. The one thing it doesn't do yet is handle GPU binaries well because load modules for GPU binaries are identified by a full path rather than a relative path in the measurements directory. We can identify structure files by a relative path and a hash and have that work. This is a piece of work that remains to be done that I haven't had time to do.

I've asked Marty to finish up the work on an initial version of the structure cache that relies on both full pathname (which is a bit of an issue for GPU binaries) and MD5 hash of the binary.

This can be extended to deal with relative paths for GPU binaries and perhaps path agnostic for CPU binaries at a future time, which could be soon if important.

blue42u commented 2 years ago

The blaming calculations involve a (complex) formula which may need multiple metrics as input and def-use graph relation (for instruction level blaming). I am not sure how to capture such formulae in a generic fashion.

Having general formulae isn't necessary (the algorithms can be hard-coded), but what I do need is a sense for the limits of the variety of blaming algorithms we intend to support, as per my first question in the OP.

I don't think the design for this will be significant, ATM I think adding 2-3 new methods to ProfileFinalizer and calling them in ProfilePipeline in appropriate places will suffice. The question is what those methods are, and that's where I'm stuck. The design limitation in Prof2 is that the set of Contexts that could be blamed to must be generated early. The same limitation is in the GPU context reconstruction bits.

Below are several more detailed questions about the general blaming algorithm that would help inform a Prof2 design. I know the answers for def-use blaming, what I would like to decide is the answers for all blaming algorithms we intend to experiment with in the near-term future.

Jokeren commented 2 years ago

Can blame propagate across function calls? If not (eg. def-use blaming), we can map every instruction to a set of sibling instructions (in the same function) which may receive blame. If so, can the blame targets be generated from a single Context or do siblings/uncles also need to be considered?

Not possible on NVIDIA GPUs unless a function is inlined.

blue42u commented 2 years ago

Not possible on NVIDIA GPUs unless a function is inlined.

@Jokeren Why not? If a pipeline stall occurs in a function, couldn't we blame it on global load instructions that were just before the function call? Or does the NVIDIA architecture or ISA make that impossible?

Jokeren commented 2 years ago

Before calling a device function, dependency barriers will be cleared to ensure they can be reused in the device function being called. GPU ISAs rely on software annotation to mark dependencies between instructions with variable latency (e.g., memory load).

Jokeren commented 2 years ago

There's one edge case that cross function dependencies might be possible. If a device function A is nested inside a global function B, then B does not have to following the normal calling convention to invoke A, its call instruction performs just like a jump instruction.

Anyway, these are all for NVIDIA GPUs. Besides, I think dyninst does not support cross function slicing for now.

blue42u commented 2 years ago

This repository has moved to GitLab, please continue this issue there.