Closed blue42u closed 2 years ago
I agree with the basic idea and want to add my thoughts here:
I don't know whether it is always reasonable or possible to do blame shifting on every metric. An instruction that incurred a REALTIME
metric could mean it is indeed executed quite often and efficiently (in this case, we cannot really blame other instructions) or it is stalled quite often, which leads to a high REALTIME
metric value (in this case, we may or may not be able figure out why this instruction is stalled purely based on the instruction itself). Some metrics, by definitions, are easier to do blame than others, such as memory stalls. Time measurement such as REALTIME
or CYCLES
are not easy to do blame shifting, and I think that's why we in general complements time based measurements with stall based metrics.
When we talk about "blame", we often mean instruction level blame, like the ones we discovered through backward slicing. Such blame is good for memory stalls and instruction dependency stalls. It may not necessarily be useful for other instruction latencies, such as instruction cache misses. If an instruction caused an i-cache miss, I don't feel we can blame any particular other single instruction for this. Maybe this is still a case, where i-cache miss is still a metric that we cannot do blame.
Regarding when the blame should be done, I think we have two general cases: (1) the blame has to be done at hpcrun time when measure the program (an example is the GPU IDLENESS metric where we at runtime maintain the number of current GPU operations and blame threads when the count is zero) and (2) the blame is done post-mortem where we just record the metric and its PC at hpcrun time and then use whatever binary analysis needed post-mortem. For type (1), we have a much tighter budget on what we can do in terms of software dependencies and how much time we can slowdown the application. For type (2), we can leverage whatever parallelism we can think of to speed things up.
Regarding whether blame is optional or required, or whether unblamed data is needed, I am in a position to believe that no one really knows the answer and my current is "maybe".... The actual performance diagnosis process is still and will continue to be complex and adhoc. We do not know when we will need what, and then how to fix a problem. I think we will need several years of work to figure out a more principled way of diagnosing performance problems, especially for GPUs.
Thanks for the comments! I agree with @mxz297's points, my own comments below:
add
instruction with no memory access, it should take a single cycle (on most processors) so cycles - samples
maybe can be blamed back to cycles
on its instruction dependencies. On the other hand, an L1I-cache miss doesn't have anyone else to blame other than the instruction itself, and REALTIME
could be caused by any number of things so its not known what to blame.@blue42u @mxz297
I have some comments to add to this
Yes, the blaming approach should be postmortem. In the current prototype version of latency blaming, we perform backward-slicing on the gpubins inside hpcstruct. For large applications such as AMR-Wind, backward-slicing on large gpubins takes hours. The resulting def-use graph is saved in a file within the measurements directory. I had a design suggestion. Should these def-use files be outputted in a separate directory(directory def_use
) outside measurements? If we do this, the user need only run the blaming mode whenever the GPU kernel code is updated. Then subsequent runs of hpcrun can be paired with the def_use
directory.
For large applications such as AMR-Wind, backward-slicing on large gpubins takes hours.
Hours seems like a lot, do you know where the time is going? Can it be easily accelerated with some parallelism?
Should these def-use files be outputted in a separate directory(directory
def_use
) outside measurements? If we do this, the user need only run the blaming mode whenever the GPU kernel code is updated. Then subsequent runs of hpcrun can be paired with thedef_use
directory.
I'm thinking this method is a bit overcomplicated, we've discussed having an hpcstruct
cache before which would be much simpler. Then the user could just always run hpcstruct
and not worry about whether things have changed or not, and gain the same time savings.
The aggregate def-use computation takes >1hr. IIRC, the largest gpubin took 10-15min (after leveraging parallelism).
Also, a hpcstruct cache does seem like a better solution. Is there any pointer for past discussions on this?
I see, so it's more that there are a lot of gpubins and def-use takes a while for any one of them.
I think the Structfile cache was mentioned at one of the group meetings but never made it to a proper issue. IIRC there wasn't any previous discussion other than "that sounds useful, we should do that."
@blue42u Although the discussion is generic, I can add a few points pertaining to latency-blaming:
Should we present the original "unblamed" metric values as a separate metric column, or does it make sense to only present the "blamed" values?
I think both blamed and unblamed metrics should be visible. If the unblamed metrics should be hidden, the show flag for this metric should be set to false in hpcrun.
Is "blaming" optional or required? Should we error if the suitable binary analysis isn't available (i.e. missing a Structfile)?
Blaming is an expensive operation, so it should only be turned on request. Passing flags as inputs can be one way of determining which modes of blaming need to be turned on inside prof2.
The blaming calculations involve a (complex) formula which may need multiple metrics as input and def-use graph relation (for instruction level blaming). I am not sure how to capture such formulae in a generic fashion. I have attached below the formula used for latency blaming just to give an example of the kind of calculation involved
The structure cache is largely implemented #507. The one thing it doesn't do yet is handle GPU binaries well because load modules for GPU binaries are identified by a full path rather than a relative path in the measurements directory. We can identify structure files by a relative path and a hash and have that work. This is a piece of work that remains to be done that I haven't had time to do.
I've asked Marty to finish up the work on an initial version of the structure cache that relies on both full pathname (which is a bit of an issue for GPU binaries) and MD5 hash of the binary.
This can be extended to deal with relative paths for GPU binaries and perhaps path agnostic for CPU binaries at a future time, which could be soon if important.
The blaming calculations involve a (complex) formula which may need multiple metrics as input and def-use graph relation (for instruction level blaming). I am not sure how to capture such formulae in a generic fashion.
Having general formulae isn't necessary (the algorithms can be hard-coded), but what I do need is a sense for the limits of the variety of blaming algorithms we intend to support, as per my first question in the OP.
I don't think the design for this will be significant, ATM I think adding 2-3 new methods to ProfileFinalizer
and calling them in ProfilePipeline
in appropriate places will suffice. The question is what those methods are, and that's where I'm stuck. The design limitation in Prof2 is that the set of Contexts that could be blamed to must be generated early. The same limitation is in the GPU context reconstruction bits.
Below are several more detailed questions about the general blaming algorithm that would help inform a Prof2 design. I know the answers for def-use blaming, what I would like to decide is the answers for all blaming algorithms we intend to experiment with in the near-term future.
A
onto B1,B2,B3,...
always purely a function of the metric values on A,B1,B2,B3,...
, or can they involve the metric values of some unobviously related Context C
?Can blame propagate across function calls? If not (eg. def-use blaming), we can map every instruction to a set of sibling instructions (in the same function) which may receive blame. If so, can the blame targets be generated from a single Context or do siblings/uncles also need to be considered?
Not possible on NVIDIA GPUs unless a function is inlined.
Not possible on NVIDIA GPUs unless a function is inlined.
@Jokeren Why not? If a pipeline stall occurs in a function, couldn't we blame it on global load instructions that were just before the function call? Or does the NVIDIA architecture or ISA make that impossible?
Before calling a device function, dependency barriers will be cleared to ensure they can be reused in the device function being called. GPU ISAs rely on software annotation to mark dependencies between instructions with variable latency (e.g., memory load).
There's one edge case that cross function dependencies might be possible. If a device function A is nested inside a global function B, then B does not have to following the normal calling convention to invoke A, its call instruction performs just like a jump instruction.
Anyway, these are all for NVIDIA GPUs. Besides, I think dyninst does not support cross function slicing for now.
Background
Performance metrics by nature are "symptomatic," they get attributed to where a performance characteristic arises, for instance
REALTIME
is attributed to the instructions that are taking time. Usually once identifying a performance issue/hotspot, the next step is to manually determine the root cause of the issue and then remove it.Sometimes we can do better than that, we can instead identify instruction-level interactions with binary analysis and "blame" the cause rather than the symptom. Sometimes this is required to get reasonable performance data for an architecture, for instance pipeline-wait instructions (common in GPU architectures) are never the root cause of a performance problem. In theory, doing "blaming" should significantly improve the quality of the performance analysis we can provide.
Motivating Initial Questions
Ping @aarontcopal2 for comments since this is in part motivated by the Intel GPU front.