adrianstone55 / SymbolSort

A Utility for Measuring C++ Code Bloat
http://gameangst.com/?p=320
Apache License 2.0
117 stars 17 forks source link

See how much bloat is generated by template class #17

Open stgatilov opened 6 years ago

stgatilov commented 6 years ago

A lot of code bloat comes from generic template classes. For instance, let it be MyVector defined in MyVector.h. If would be great if SymbolSort would allow to see how much code was generated by such class.

Right now it is possible to analyze object files (COMDAT), but there is no way to group symbols by class or by header file in such case. Also, it is possible to analyze PDB, but then duplication of symbols across object files is not taken into account (and it is important for analyzing build times).

I see two approaches to implement this feature:

  1. Extract classes from symbol names. Ideally, they can be extracted with namespaces, e.g. std::_XTree, and then grouped like SymbolSort does for paths. This is perhaps the best approach, but given how many special types of symbols exist, it becomes very hard to do it right. In fact, it is necessary to implement full-fledged parser of symbol names (and perhaps decorated symbols are even easier to parse than undecorated ones) to do it right.

  2. Attribute each symbol to the source file where its code is located. This information is absent in object files, but it is present in PDB files. So it is possible to read object file dumps for the main data, then read PDB files solely for setting proper code location to symbols. This approach has some disadvantages: mainly, not all symbols are present in PDB, and not all symbols have any location in source code.

stgatilov commented 6 years ago

I have implemented the second approach in my fork. You can see the full set of changes here.

Please let me know if pull request is welcome.

P.S. The approach 2 has some additional advantages. For instance, in theory it is possible to produce annotated version of source files, where count/total stats are added as comment before each function.

stgatilov commented 6 years ago

I have also implemented the first approach, i.e. extracting classpath from symbol name. It works like this:

  1. Take raw symbol name (i.e. mangled/decorated one).
  2. Undecorate it partially, omitting return value and function parameters (and probably smth else).
  3. Parse undecorated name using several templates, regexes, and other dirty stuff like that.

First I tried to use UnDecorateSymbolName for point 2, but it is located in dbghelp.dll, which has not been updated for quite a long time. It cannot handle C++11 features like Rvalue references. This implementation is currently in classpath branch.

Then I switched to calling undname.exe util from MSVC distribution. It works perfectly (it is perhaps the only official way to demangle MSVC symbols today). The code is in classpath2 branch. All the differences can be see here.

adrianstone55 commented 6 years ago

Hi, sorry for the slow response, but I've been away on vacation. I think you're analysis of the problem is spot on. PDBs are interesting, but to analyze code bloat from weak instantiations you need to look at the OBJ files. I would probably lean towards the second approach, because trying to correlate input from two different sources could get messy, but there are advantages and disadvantages both ways.

If you want to put together a pull request, I'll happily consider it, but I might be a bit slow because I'm not actively maintaining the code anymore and I haven't even used it more than a couple times in the past five years.

stgatilov commented 6 years ago

Both approaches already work for me. Surely, both has pluses and minuses.

In classpath approach, analysis relies on hacky regexes for parsing symbol names. Despite that, almost all symbols are taken into account. In the pdb filepath approach, not all symbols actually have location in PDB. About 20-30% of symbols are usually implicitly generated stuff or some data. On the bonus side, it gives per-directory stats, so it is very simple to see code bloat from whole STL.

My plan is to write a blog article about these two options. Then it will be easier to make decision. P.S. As for now, continuing to post small pull requests...

stgatilov commented 6 years ago

Ok, finished with article.

Here is the full article. To not waste time, I suggest you to start reading from Improvements section.

Now I'll prepare pull requests for both features.