dannymitchell / sawbuck

Automatically exported from code.google.com/p/sawbuck
0 stars 0 forks source link

Disassembly should keep coverage metrics #32

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
If the disassembly kept metrics on how many bytes disassembled, and how many 
bytes attributed to data (jump tables), we could dump an overall disassembly 
coverage metric after decomposition.

Original issue reported on code.google.com by siggi@chromium.org on 9 Mar 2011 at 3:13

GoogleCodeExporter commented 9 years ago

Original comment by chri...@google.com on 22 Mar 2011 at 7:05

GoogleCodeExporter commented 9 years ago
Our current disassembler makes many assumptions about the code it is parsing.  
Notably, we assume certain behaviour regarding the placement and use of lookup 
tables.  Hand coded assembly does many things that violate these assumptions 
(notably the entire crt library; a particularly bad offender is memcpy).  It 
would be useful to be able to distinguish hand written assembly from compiler 
generated code, and only enforce our stronger assumptions on the latter.  The 
DIA API exposes this information via IDiaSymbol::get_language, and it would be 
useful to annotate blocks with this information, extending BlockAttributeEnum.

Original comment by chri...@google.com on 23 Mar 2011 at 6:55

GoogleCodeExporter commented 9 years ago
Unfortunately, after exhaustively exploring the DIA symbols there is no 
reliable way to determine whether a function is built from assembly or from a 
higher level language.  

The main motivation for finding this information was in order to handle data 
sections. We know that the compiler (seems to?) put any static data at the end 
of function, including jump tables, etc.  Assembly functions can place data 
wherever they want, including in the middle of the function body.  Our data 
detection routines were able to be smarter assuming we know that the code was 
generated by the compiler.

Further investigations into the available DIA symbols revealed that information 
regarding all static data *is* included in the PDB.  Pushing this information 
to the disassembler (along with alignment information, also present in the PDB) 
should allow us to get a full disassembly of functions, including all data and 
padding bytes.  It also allows us to move away from heuristics for finding data 
locations, which often fail in hand-coded assembly.  (For example, we presently 
assume that lookup tables are zero-indexed, but in 'memcpy' they are not.  This 
causes us to identify certain bytes as data, when they are in fact part of an 
instruction.)

With this new information we will be able to skip the heuristics and reliably 
label data.  This will also allow us to stop the disassembler from running into 
data.

Presently, the Decomposer provides information to the Disassembler in two 
manners: through the OnInstruction callback, and through the Disassembler API 
prior to calling 'Walk'.  Using the OnInstruction callback is not sufficient 
elegant because we can only provide information regarding an already decompiled 
instruction; we would be able to tell the disassembler to back-up if it started 
running into known data, but without greatly changing the API we could not tell 
it about data extents.

In my mind, the simplest approach would be to extend Disassembler to accept 
data extents much like it currently accepts labels using 'Unvisited'.

Original comment by chri...@google.com on 24 Mar 2011 at 8:21

GoogleCodeExporter commented 9 years ago
It has been observed that our data finding/hitting heuristics are now in fact 
incorrect.  We had previously been using the base address of table lookups (as 
an argument to jmp functions) as an indication that data lives at that address. 
 We would then stop disassembly when it would overrun what had been assumed to 
be data.  Unfortunately, for hand-written assembly these lookup tables are not 
always meant to be zero-indexed, in which case our assumed data location was 
wrong (see for example, memcpy).

All of these heuristics become unnecessary with reliable data information, and 
will not be needed once we extract Data information via DIA.

Original comment by chri...@chromium.org on 28 Mar 2011 at 1:40

GoogleCodeExporter commented 9 years ago
More accumulated knowledge that I feel the need to write down somewhere: the 
public   symbols provided by DIA do not have meaningful lengths. In fact, the 
lengths are simply the distance between successive public symbols. However, we 
need to use them because they are the only place we get information about the 
location of virtual tables.

Original comment by chri...@chromium.org on 8 Apr 2011 at 8:18

GoogleCodeExporter commented 9 years ago
Fixed in http://code.google.com/p/sawbuck/source/detail?r=253.

Original comment by chri...@chromium.org on 19 Apr 2011 at 6:09