Closed GoogleCodeExporter closed 9 years ago
Original comment by chri...@google.com
on 22 Mar 2011 at 7:05
Our current disassembler makes many assumptions about the code it is parsing.
Notably, we assume certain behaviour regarding the placement and use of lookup
tables. Hand coded assembly does many things that violate these assumptions
(notably the entire crt library; a particularly bad offender is memcpy). It
would be useful to be able to distinguish hand written assembly from compiler
generated code, and only enforce our stronger assumptions on the latter. The
DIA API exposes this information via IDiaSymbol::get_language, and it would be
useful to annotate blocks with this information, extending BlockAttributeEnum.
Original comment by chri...@google.com
on 23 Mar 2011 at 6:55
Unfortunately, after exhaustively exploring the DIA symbols there is no
reliable way to determine whether a function is built from assembly or from a
higher level language.
The main motivation for finding this information was in order to handle data
sections. We know that the compiler (seems to?) put any static data at the end
of function, including jump tables, etc. Assembly functions can place data
wherever they want, including in the middle of the function body. Our data
detection routines were able to be smarter assuming we know that the code was
generated by the compiler.
Further investigations into the available DIA symbols revealed that information
regarding all static data *is* included in the PDB. Pushing this information
to the disassembler (along with alignment information, also present in the PDB)
should allow us to get a full disassembly of functions, including all data and
padding bytes. It also allows us to move away from heuristics for finding data
locations, which often fail in hand-coded assembly. (For example, we presently
assume that lookup tables are zero-indexed, but in 'memcpy' they are not. This
causes us to identify certain bytes as data, when they are in fact part of an
instruction.)
With this new information we will be able to skip the heuristics and reliably
label data. This will also allow us to stop the disassembler from running into
data.
Presently, the Decomposer provides information to the Disassembler in two
manners: through the OnInstruction callback, and through the Disassembler API
prior to calling 'Walk'. Using the OnInstruction callback is not sufficient
elegant because we can only provide information regarding an already decompiled
instruction; we would be able to tell the disassembler to back-up if it started
running into known data, but without greatly changing the API we could not tell
it about data extents.
In my mind, the simplest approach would be to extend Disassembler to accept
data extents much like it currently accepts labels using 'Unvisited'.
Original comment by chri...@google.com
on 24 Mar 2011 at 8:21
It has been observed that our data finding/hitting heuristics are now in fact
incorrect. We had previously been using the base address of table lookups (as
an argument to jmp functions) as an indication that data lives at that address.
We would then stop disassembly when it would overrun what had been assumed to
be data. Unfortunately, for hand-written assembly these lookup tables are not
always meant to be zero-indexed, in which case our assumed data location was
wrong (see for example, memcpy).
All of these heuristics become unnecessary with reliable data information, and
will not be needed once we extract Data information via DIA.
Original comment by chri...@chromium.org
on 28 Mar 2011 at 1:40
More accumulated knowledge that I feel the need to write down somewhere: the
public symbols provided by DIA do not have meaningful lengths. In fact, the
lengths are simply the distance between successive public symbols. However, we
need to use them because they are the only place we get information about the
location of virtual tables.
Original comment by chri...@chromium.org
on 8 Apr 2011 at 8:18
Fixed in http://code.google.com/p/sawbuck/source/detail?r=253.
Original comment by chri...@chromium.org
on 19 Apr 2011 at 6:09
Original issue reported on code.google.com by
siggi@chromium.org
on 9 Mar 2011 at 3:13