CMUSTRUDEL / DIRTY

DIRTY: Augmenting Decompiler Output with Learned Variable Names and Types
MIT License
164 stars 24 forks source link

[???] encode_memory() dirtiness #15

Open kotee4ko opened 9 months ago

kotee4ko commented 9 months ago

Hello. Thanks for you're kindness to share such a good project.

Could you please explain me why does we need encode_memory

    @staticmethod
    def encode_memory(mems):
        """Encode memory to ids

        <pad>: 0
        <SEP>: 1
        <unk>: 2
        mem_id: mem_offset + 3
        """
        ret = []
        for mem in mems[: VocabEntry.MAX_MEM_LENGTH]:
            if mem == "<SEP>":
                ret.append(1)
            elif mem > VocabEntry.MAX_STACK_SIZE:
                ret.append(2)
            else:
                ret.append(3 + mem)
        return ret

this function, and why does it attempt to compare integers with '' ? I can't get send of appending int token to array of int tokens instead of int token.

and second question is about this one:


            def var_loc_in_func(loc):
                print(" TODO: fix the magic number for computing vocabulary idx")
                if isinstance(loc, Register):
                    return 1030 + self.vocab.regs[loc.name]
                else:
                    from utils.vocab import VocabEntry

                    return (
                        3 + stack_start_pos - loc.offset
                        if stack_start_pos - loc.offset < VocabEntry.MAX_STACK_SIZE
                        else 2
                    )

what and why is 1030 constant do?

And in general, why we define tokens as that:

            self.word2id["<pad>"] = PAD_ID
            self.word2id["<s>"] = 1
            self.word2id["</s>"] = 2
            self.word2id["<unk>"] = 3
            self.word2id[SAME_VARIABLE_TOKEN] = 4

but using as this:

        <pad>: 0
        <SEP>: 1
        <unk>: 2
        mem_id: mem_offset + 3

Sorry, if my questions in too much, I specialize on system programming, and math with ML is a hobby. Forward Thanks =)

@pcyin @qibinc @jlacomis

qibinc commented 9 months ago

Hi @kotee4ko , thanks for your interest in this project! I apologize for the hacky implementation of the modeling part. Hope the following answers can help:

Could you please explain me why does we need encode_memory

The encode_memory function is responsible for mapping a variable's size (red) and offsets (yellow) into vocab ids. Together with var_loc_in_func (green), it implements the encoding part of Figure 4 in the paper.

image

This function is required because Transformers have a fixed vocab size. Instead of directly passing mems to the model, we need an upper limit MAX_STACK_SIZE and convert the ones exceeding it to a special token <unk> and special id (which is 3 in this case).

this function, and why does it attempt to compare integers with '' ?

Since mems is an int array, the condition if mem == "<SEP>" is never met and can be safely deleted.

I can't get send of appending int token to array of int tokens instead of int token.

Could you elaborate more on this question?

what and why is 1030 constant do?

A variable can be on register or stack. In order to distinguish register x and stack position x, we assign them with different ids. In this case, vocab id range [0, 1027] (1027 = 3 + MAX_STACK_SIZE) represents stack postions, and [1030, 1030 + \<vocab size of registers>] represents register positions

why we define tokens as that x and using as y

We have two Transformer encoders XfmrSequentialEncoder, responsible for code tokens, and XfmrMemEncoder, responsible for mem tokens (location, size, offsets). They have separate embeddings and vocabs. The first part on self.word2id is for the code vocab, while the second part with mem_id is for the mem vocab.

Feel free to follow up if you have more questions!

kotee4ko commented 9 months ago

@qibinc , Sir, I can't get few things.

1) If we always need to adjust each token by 3 (count of special tokens) in token list? 2) If for second "block" of tokens (registers) we need to adjust it TWICE? 3) Why we can't just let the constant registers quantity been first [3, 3+len(vocab.regs)] and memory (stack) been second [len(vocab.regs) + (3 or 3+3 ?), len(vocab.regs) + (3 or 3+3 ?) + MAX_STACK_SIZE] 4) reg_name:56, reg_id:5 which one we are going to adjust by 3? Depending on encoder? (56 for code vocab and 5+3 for mem vocab)?

5) if this is what is expected?

============
New loc:Reg 56; src_var.typ.size:8 src_var.typ.start_offsets():(0,)
calculating variable location, loc__:Reg 56
variable type is register , will adjust position by MAX_STACK_SIZE+3=1027
reg_name:56, reg_id:5
mems:(8, 0)
ret:[11, 3]
vloc:1035 
tmem:[11, 3] 
var_sequence:[1035, 11, 3]

New loc:Stk 0xa0; src_var.typ.size:144 src_var.typ.start_offsets():(0, 8, 16, 24, 28, 32, 36, 40, 48, 56, 64, 72, 88, 104, 120)
calculating variable location, loc__:Stk 0xa0
variable type is stack, will adjust position by 0+3=3
VocabEntry.MAX_STACK_SIZE:1024, stack_start_pos:216, offset:160
mems:(144, 0, 8, 16, 24, 28, 32, 36, 40, 48, 56, 64, 72, 88, 104, 120)
ret:[147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123]
vloc:59 
tmem:[147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123] 
var_sequence:[59, 147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123]

6) Why we can't just redefine special tokens as negative values, to avoid incrementing of each value in the list? 7) here in comments it sad about relative position for mem/stack and absolute location for regs. Generally, this value is just position/offset adjusted by container type constant, right? And why we doesn't check bounds of the stack? I mean here it can be true, but variable total size (in case of array, or structure) could overflow this limit. And in that case we would have offsets which would be interpreteted as registers?

Where is the code, which process this encoded tensors?

Wow, this is real hardcore, Sir.

kotee4ko commented 9 months ago

@qibinc

Sir, I need to refactor code to be able to launch it on very specific AMD GPU. Can you tell, if this code would be logically correct? The difference is in accuracy() method, which is behave a bit different that old one.

Thanks.


def tmaxu(t1, t2):
    tm1, tm2 = t1.unique().numel(), t2.unique().numel()
    # print(f"T1 (len = {t1.numel()}, uniq: {tm1} \n{t1}\n"
    #      f"T2 (len = {t2.numel()}, uniq: {tm2} \n{t2}\n"
    #      )
    ret = max(tm1, tm2, 2)
    return ret

    def _shared_epoch_end(self, outputs, prefix):
        final_ret = {}
        if self.retype:
            ret = self._shared_epoch_end_task(outputs, prefix, "retype")
            final_ret = {**final_ret, **ret}
        if self.rename:
            ret = self._shared_epoch_end_task(outputs, prefix, "rename")
            final_ret = {**final_ret, **ret}
        if self.retype and self.rename:
            # Evaluate rename accuracy on correctly retyped samples
            retype_preds = torch.cat([x[f"retype_preds"] for x in outputs])
            retype_targets = torch.cat([x[f"retype_targets"] for x in outputs])
            rename_preds = torch.cat([x[f"rename_preds"] for x in outputs])
            rename_targets = torch.cat([x[f"rename_targets"] for x in outputs])
            if (retype_preds == retype_targets).sum() > 0:
                binary_mask = retype_preds == retype_targets
                p_t = rename_preds[binary_mask]
                t_t = rename_targets[binary_mask]
                self.log(
                    f"{prefix}_rename_on_correct_retype_acc",
                    accuracy(
                        p_t,
                        t_t,
                        task='multiclass',
                        num_classes=tmaxu(p_t, t_t)
                    ),
                )

        return final_ret
kotee4ko commented 9 months ago

-5474258040539699684_121 Wohoooo! Seems, it is working?

@qibinc Thanks

One more question: how should I average name and type predictions? Just like (name_loss + type_loss / 2), or anything else?

kotee4ko commented 9 months ago

Wow, what a dirty trick!

(Pdb) model.vocab.regs.id2word {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<IDENTITY>', 5: '56', 6: '64', 7: '16', 8: '8', 9: '24', 10: '72', 11: '80', 12: '32', 13: '496', 14: '104', 15: '48', 16: '120', 17: '112', 18: '512', 19: '128', 20: '528', 21: '544', 22: '560', 23: '88', 24: '576', 25: '592', 26: '608', 27: '40', 28: '440', 29: '432', 30: '96', 31: '1', 32: '1280', 33: '424', 34: '472', 35: '448', 36: '464', 37: '480', 38: '456', 39: '5', 40: '0', 41: '1288', 42: '624', 43: '1296', 44: '640', 45: '656', 46: '672', 47: '688', 48: '2', 49: '1284', 50: '144', 51: '704', 52: '192', 53: '720', 54: '736', 55: '1312', 56: '400', 57: '1344', 58: '1328', 59: '1360', 60: '1376', 61: '1408', 62: '1392', 63: '1424', 64: '3', 65: '1440', 66: '1456', 67: '1472', 68: '1488', 69: '1504', 70: '1520', 71: '1536', 72: '1552', 73: '1568', 74: '1584', 75: '1600', 76: '1616', 77: '1632', 78: '1648', 79: '1664', 80: '1696', 81: '176', 82: '368', 83: '384', 84: '1680', 85: '1712', 86: '1728', 87: '1760', 88: '1792', 89: '1824', 90: '500'}


{"name":"openWrite",
"code_tokens":["__int64","__fastcall","openWrite","(","const","char","*","@@a1@@",",","int","@@a2@@",")","{","int","@@v3@@",";","int","@@oflag@@",";","if","(","@@a2@@",")","@@oflag@@","=","Number",";","else","@@oflag@@","=","Number",";","@@v3@@","=","open","(","@@a1@@",",","@@oflag@@",",","Number","L",")",";","if","(","@@v3@@","<","Number",")","errnoAbort","(","String",",","@@a1@@",")",";","return","(","unsigned","int",")","@@v3@@",";","}"],
"source":
    {
        "s8":{"t":{"T":1,"n":"int","s":4},"n":"v3","u":false},
        "s4":{"t":{"T":1,"n":"int","s":4},"n":"oflag","u":true},
        "r56":{"t":{"T":3,"t":"const char"},"n":"a1","u":false},
        "r64":{"t":{"T":1,"n":"int","s":4},"n":"a2","u":false}},
        "target":{"s8":{"t":{"T":1,"n":"int","s":4},"n":"fd","u":true},
        "s4":{"t":{"T":1,"n":"int","s":4},"n":"flags","u":true},
        "r56":{"t":{"T":3,"t":"char"},"n":"fname","u":false},
        "r64":{"t":{"T":1,"n":"int","s":4},"n":"append","u":false}},
        "test_meta":{"function_name_in_train":false,"function_body_in_train":false}
}

dict_keys([
    'index', 
    'src_code_tokens', 
    'variable_mention_to_variable_id', 
    'variable_mention_mask', 
    'variable_mention_num', 
    'variable_encoding_mask', '
    target_type_src_mems', 
    'src_type_id', 
    'target_mask', 
    'target_submask', 
    'target_type_sizes'
])

(Pdb) model.vocab.names.id2word[5] = ''        # first non-spec elem in names vocab starts on offset +5
(Pdb) model.vocab.types.id2word[7] = '__int64' # first non-spec elem in types vocab starts on offset +7

(Pdb) input_dict['index'] = [
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@a1@@'], 
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@a2@@'], 
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@v3@@'], 
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@oflag@@']]

(Pdb) input_dict['src_code_tokens'][0].numel() = 87
input_dict['src_code_tokens'][0] = 
tensor(
    [
        1, 2069, 2008, 2012, 2010, 2063, 3251, 3877, 1995, 2088, 2046, 2001,
        9917, 1226, 2007, 2021, 9917,  402, 1996, 2019, 2021, 9917, 1263, 1997,
        2021, 9917, 1316, 1997, 2029, 1995, 9917,  402, 1996, 9917, 1316, 2009,
        2004, 1997, 2082, 9917, 1316, 2009, 2004, 1997, 9917, 1263, 2009, 3251,
        1995, 9917, 1226, 2007, 9917, 1316, 2007, 2004, 2027, 1996, 1997, 2029,
        1995, 9917, 1263, 2038, 2004, 1996, 9917, 2506, 9981, 3733, 1995, 2065,
        2007, 9917, 1226, 1996, 1997, 2049, 1995, 2036, 2021, 1996, 9917, 1263,
        1997, 2020,    2
    ]
)

(Pdb) input_dict['variable_mention_to_variable_id'][0].numel() = 87
(Pdb) input_dict['variable_mention_to_variable_id'][0] =
tensor(
    [
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0,
        0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 2, 0, 0,
        0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0
    ]
)

(Pdb) input_dict['variable_mention_mask'][0].numel() = 87
(Pdb) input_dict['variable_mention_mask'][0] = 
tensor(
    [
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
        0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.
    ]
)

(Pdb) input_dict['variable_mention_num'] = tensor([   [3., 2., 4., 4.]   ])

(Pdb) input_dict['variable_encoding_mask'] = tensor([    [1., 1., 1., 1.]    ])

(Pdb) input_dict['target_type_src_mems'] = 
tensor(
    [ #3d (func)
        [ #2d (var)
            [1035,   11,    3], #1d (var_mem_repr)  offset if < 1024+3 else reg_num+3+3, size+3, field_offset+3  
            [1036,    7,    3], 
            [   3,    7,    3],
            [   7,    7,    3]
        ]
    ]
)
(Pdb) model.vocab.regs.word2id['56'] = 5
(Pdb) model.vocab.regs.word2id['64'] = 6

# so, 1035-1024 = 11; 11 - 3 - 3 = 5; 5 == 5;
# and 1035-1024 = 12; 12 - 3 - 3 = 6; 6 == 6;

(Pdb) input_dict['src_type_id'] = tensor(
    [
        [9, 5, 5, 5] # vars type ids
    ]
)
(Pdb) model.vocab.types.word2id['const char *'] = 9 # "r56":{"t":{"T":3,"t":"const char"},"n":"a1","u":false},
(Pdb) model.vocab.types.word2id['int'] = 5          # "r64":{"t":{"T":1,"n":"int","s":4},"n":"a2","u":false}},

# src_type_id --> model.vocab.types