Self timings vs. hierarchical timings

alecazam commented 9 months ago

Noticing that the timings don't add up, or are spread across multiple cores. One issue appears to be that top offenders are just the hierchical timing of the calls, without subtracting the children.

Would be nice to have self timings sorted out, so can see which headers are specifically at fault for all the build slowdowns. Maybe there's a mode for this, and I'm just too much of a newbie using the tool. But I do love it. So thanks for making this open-source.

Here's an example from our traces. So hash_table is a part of unordered_map. I doubt those both take up 137s and 123s. But maybe hash_table is 123/137 of unordered_map. Self timings would more accurately portray the offender. But it is a huge amount of instantiations of both. It's just when I look at -ftime-trace flame graphs other functions we have are not self times, and have a lot of overlap.

**** Template sets that took longest to instantiate:
137531 ms: std::unordered_map<$> (16285 times, avg 8 ms)
123470 ms: std::__hash_table<$> (18299 times, avg 6 ms)
 65691 ms: std::function<$>::function<$> (1805 times, avg 36 ms)
 65352 ms: std::__function::__value_func<$>::__value_func<$> (1805 times, avg 36 ms)
 54030 ms: std::__function::__func<$>::__func (1805 times, avg 29 ms)
 50907 ms: std::forward_as_tuple<$> (8135 times, avg 6 ms)
 48883 ms: std::allocator_traits<$> (43333 times, avg 1 ms)
 47541 ms: std::__function::__alloc_func<$>::__alloc_func (5415 times, avg 8 ms)

aras-p commented 9 months ago

Yeah I have not implemented self-timings report (yet?). Not sure when/if I'll get around to that.

alecazam commented 9 months ago

What I'm trying to figure out is who is instantiating that hash_map 16k times. I assume the template set is collecting the instantiation of any function out of that template. But it's pretty terrible to always see these system STL templates at the tops of builds. Some description of the categories would be helpful in the readme. But I'l see if the source provides more detail.

ben-craig commented 9 months ago

Hi alecazam. I authored the template set reporting. The std::unordered_map entries are for instantiating the class, and not any of the methods. You can see how std::function::function also shows up, so that specific function / ctor is tracked separately from the class.

The template set functionality shines most when finding expensive templates that are instantiated with a moderate number of unique template parameters. That's usually a good hint to optimize the template itself, often by extracting parts that don't depend on the template parameter.

If you are still concerned about unordered_map uses, then I would suggest grepping your source code for unordered_map usages in headers, and then keep an eye out for places where you would get a large quantity of different types as key / value pairs. In the past, I've seen that when using enums as keys or values. You can then try to decide what to do about these cases. I've done things where I've type erased the enums into their underlying types, and that let me cut down on unique types and instantiations by quite a bit.

alecazam commented 9 months ago

Seems like the Variant union is still a good idea in this day and age. I just wish I had some idea of what file was generating all the templates. I have a good idea in this case, but wondering if that could be automated. With STL there ‘an always a level above that is the culprit.

alecazam commented 9 months ago

I'm building a profiler shell to help with build, memory, and perf optimization. It's a flamegraph tool using Perfetto right now. And I'd like the CBA iteration to be a part of it. At first probably just calling the CBA cli, but would like to eventually tie into the C++ calls themselves. Not officially released yet, but still pretty usable. Still a few little issues to fix.

https://github.com/alecazam/kram/tree/main/kram-preview

I just added support in this tool to swap out the clang name "Source" fields with the file details, and I found all the unordered_map instantiations. Thanks for pointing me to the problem!

alecazam commented 9 months ago

Actually I cut 30s worth of unordered_map, but still have to find the rest. This involves searching through the clang json output on a per file basis. So anything to accelerate this would help.

Something like top headers contributing to that count. So just one level up that isn't the same STL class. Problem is we use wrappers like TGCMap.h and TGCVector.h to wrap unordered_map and vector usage, but we really need one level up from even that.

13steinj commented 8 months ago

Coincidentally I'm running into the same/similar issue/request-- I'm finding that in really long templated-function call stacks (that I don't have any choice in them being so long), it matters more to me that the self-time of a given template took a given amount of time. For example, this output:

**** Templates that took longest to instantiate:
126389 ms: org::apps::appname::visitorPatternFunction<(lambda at header.hpp:16:9)> (6 times, avg 21064 ms)

Is incredibly misleading. Because looking at the callstack using -ftime-trace (thanks btw!), there's a ~ 1 minute 3 second 200 ms call that calls a ~ 1 minute 3 second 100 ms call which calls .... everything else, followed by 4 cheap calls elsewhere in the trace. Initial inspection without checking would make one thing "holy shit, this function takes an average of 21s to instantiate!?" but the reality is it just calls other functions of which a giant tree is the vast majority of 2 calls. Adding stddev and range might be nice too.

If the self-time of a template is a certain amount of time, and fairly consistent, that template can be reduced look into removal.

But if it's taking into account child time, not that helpful.

alecazam commented 8 months ago

Seems like CBA has already built up the parent/child lists for the build events. Self time is just adding all of the children durations to another field in the parent (I call this durSub), then subtracting that from the dur to get the self time. This better pinpoints slow headers, since the child timing wouldn't be a part of the report.

I added the self timings for the parser to the Build Report of kram-profile. Don't have the other stats yet, but the same logic applies to all the stats.

aras-p / ClangBuildAnalyzer

Self timings vs. hierarchical timings #92