binsync / libbs

A library for writing plugins in any decompiler: includes API lifting, common data formatting, and GUI abstraction!
BSD 2-Clause "Simplified" License
71 stars 6 forks source link

Feat: Make `Artifact`s support in-structure commenting #102

Open mahaloz opened 2 months ago

mahaloz commented 2 months ago

Background

In most decompilers, like IDA Pro, you can have types that have comments in them, like:

struct Elf64_Vernaux // sizeof=0x10
{                                       // XREF: LOAD:0000000000400410/r
     unsigned __int32 vna_hash;         // this is some comment on this first member
     unsigned __int16 vna_flags;
     unsigned __int16 vna_other;
     unsigned __int32 vna_name __offset(OFF64,0x400390);
     unsigned __int32 vna_next;
};

Which libbs does not currently support. An ideal solution would look like this:

my_struct = deci.structs["Elf64_Bernaux"]
print(my_struct.comments[0]) // this is some comment on this first member
print(my_struct.members[0].comment) // // this is some comment on this first member

Implementation

To support this type of commenting, we'll need to do a few things:

arizvisa commented 2 months ago

if you don't want to have to use the edm_t.cmt and udm_t.cmt attributes to enumerate or serialize complex field comments, you can also unpack/save them from the result of tinfo_t.serialize() ..which was the pre-8.4 method anyways ("fields" are similar).

decoding the bytes returned by tinfo_t.serialize into a list of comments is basically consuming a byte, determine whether it's an 8-bit/16-bit length, decoding said length, using the length to extract the comment, then utf-8 decoding those bytes and repeating until done.

    def decode_bytes(bytes):
        '''Decode the given `bytes` into a list containing the length and the bytes for each encoded string.'''
        ok, results, iterable = True, [], (ord for ord in bytearray(bytes))

        integer = next(iterable, None)
        length_plus_one, ok = integer or 0, False if integer is None else True
        while ok:
            one = 1 if length_plus_one < 0x7f else next(iterable, None)
            assert((one == 1) and length_plus_one > 0)
            encoded = bytearray(ord for index, ord in zip(builtins.range(length_plus_one - 1), iterable))   # using zip to clamp bytes consumed
            results.append((length_plus_one - 1, encoded)) if ok else None

            integer = next(iterable, None)
            length_plus_one, ok = integer or 0, False if integer is None else True
        return results

encoding the string passed to tinfo_t.deserialize(til, type, fields, cmts=None) requires encoding the length for each utf-8 encoded comment, and concatenating back into a stream of bytes.

apologies for the unreadability of the following.. "encode_length" is all that is relevant

    def encode_bytes(cls, strings):
        '''Encode the list of `strings` with their lengths and return them as bytes.'''
        encode_length = lambda integer: bytearray([integer + 1] if integer + 1 < 0x80 else [integer + 1, 1])
        iterable = (bytes(string) if isinstance(string, (bytes, bytearray)) else string.encode('utf-8') for string in strings)
        pairs = ((len(chunk), chunk) for chunk in iterable)
        return bytes(bytearray().join(itertools.chain(*((encode_length(length), bytearray(chunk)) for length, chunk in pairs))))

however, it's worth confirming the performance with regards to serializing/deserializing them at scale is actually relevant in binsync. minsc creates an index for all commentable "things" so that they can be tagged for searching and (mis-)used to store nearly-arbitrary data, so being able to check if a tinfo_t even has comments or distinguishing what exactly was updated (name/comment/other) in response to events (w/o having to iterate through all the fields one-by-one) made a difference.

...i'm literally praying that they don't try to retrofit repeatable/non-repeatable comments into this btw.