Open kotee4ko opened 9 months ago
Hi @kotee4ko , thanks for your interest in this project! I apologize for the hacky implementation of the modeling part. Hope the following answers can help:
Could you please explain me why does we need encode_memory
The encode_memory
function is responsible for mapping a variable's size (red) and offsets (yellow) into vocab ids. Together with var_loc_in_func
(green), it implements the encoding part of Figure 4 in the paper.
This function is required because Transformers have a fixed vocab size. Instead of directly passing mems
to the model, we need an upper limit MAX_STACK_SIZE
and convert the ones exceeding it to a special token <unk>
and special id (which is 3 in this case).
this function, and why does it attempt to compare integers with '' ?
Since mems
is an int array, the condition if mem == "<SEP>"
is never met and can be safely deleted.
I can't get send of appending int token to array of int tokens instead of int token.
Could you elaborate more on this question?
what and why is 1030 constant do?
A variable can be on register or stack. In order to distinguish register x
and stack position x
, we assign them with different ids. In this case, vocab id range [0, 1027] (1027 = 3 + MAX_STACK_SIZE
) represents stack postions, and [1030, 1030 + \<vocab size of registers>] represents register positions
why we define tokens as that x and using as y
We have two Transformer encoders XfmrSequentialEncoder
, responsible for code tokens, and XfmrMemEncoder
, responsible for mem tokens (location, size, offsets). They have separate embeddings and vocabs. The first part on self.word2id
is for the code vocab, while the second part with mem_id
is for the mem vocab.
Feel free to follow up if you have more questions!
@qibinc , Sir, I can't get few things.
1) If we always need to adjust each token by 3 (count of special tokens) in token list?
2) If for second "block" of tokens (registers) we need to adjust it TWICE?
3) Why we can't just let the constant registers quantity been first [3, 3+len(vocab.regs)]
and memory (stack) been second [len(vocab.regs) + (3 or 3+3 ?), len(vocab.regs) + (3 or 3+3 ?) + MAX_STACK_SIZE]
4) reg_name:56, reg_id:5 which one we are going to adjust by 3? Depending on encoder? (56 for code vocab and 5+3 for mem vocab)?
5) if this is what is expected?
============
New loc:Reg 56; src_var.typ.size:8 src_var.typ.start_offsets():(0,)
calculating variable location, loc__:Reg 56
variable type is register , will adjust position by MAX_STACK_SIZE+3=1027
reg_name:56, reg_id:5
mems:(8, 0)
ret:[11, 3]
vloc:1035
tmem:[11, 3]
var_sequence:[1035, 11, 3]
New loc:Stk 0xa0; src_var.typ.size:144 src_var.typ.start_offsets():(0, 8, 16, 24, 28, 32, 36, 40, 48, 56, 64, 72, 88, 104, 120)
calculating variable location, loc__:Stk 0xa0
variable type is stack, will adjust position by 0+3=3
VocabEntry.MAX_STACK_SIZE:1024, stack_start_pos:216, offset:160
mems:(144, 0, 8, 16, 24, 28, 32, 36, 40, 48, 56, 64, 72, 88, 104, 120)
ret:[147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123]
vloc:59
tmem:[147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123]
var_sequence:[59, 147, 3, 11, 19, 27, 31, 35, 39, 43, 51, 59, 67, 75, 91, 107, 123]
6) Why we can't just redefine special tokens as negative values, to avoid incrementing of each value in the list? 7) here in comments it sad about relative position for mem/stack and absolute location for regs. Generally, this value is just position/offset adjusted by container type constant, right? And why we doesn't check bounds of the stack? I mean here it can be true, but variable total size (in case of array, or structure) could overflow this limit. And in that case we would have offsets which would be interpreteted as registers?
Where is the code, which process this encoded tensors?
Wow, this is real hardcore, Sir.
@qibinc
Sir, I need to refactor code to be able to launch it on very specific AMD GPU. Can you tell, if this code would be logically correct? The difference is in accuracy() method, which is behave a bit different that old one.
Thanks.
def tmaxu(t1, t2):
tm1, tm2 = t1.unique().numel(), t2.unique().numel()
# print(f"T1 (len = {t1.numel()}, uniq: {tm1} \n{t1}\n"
# f"T2 (len = {t2.numel()}, uniq: {tm2} \n{t2}\n"
# )
ret = max(tm1, tm2, 2)
return ret
def _shared_epoch_end(self, outputs, prefix):
final_ret = {}
if self.retype:
ret = self._shared_epoch_end_task(outputs, prefix, "retype")
final_ret = {**final_ret, **ret}
if self.rename:
ret = self._shared_epoch_end_task(outputs, prefix, "rename")
final_ret = {**final_ret, **ret}
if self.retype and self.rename:
# Evaluate rename accuracy on correctly retyped samples
retype_preds = torch.cat([x[f"retype_preds"] for x in outputs])
retype_targets = torch.cat([x[f"retype_targets"] for x in outputs])
rename_preds = torch.cat([x[f"rename_preds"] for x in outputs])
rename_targets = torch.cat([x[f"rename_targets"] for x in outputs])
if (retype_preds == retype_targets).sum() > 0:
binary_mask = retype_preds == retype_targets
p_t = rename_preds[binary_mask]
t_t = rename_targets[binary_mask]
self.log(
f"{prefix}_rename_on_correct_retype_acc",
accuracy(
p_t,
t_t,
task='multiclass',
num_classes=tmaxu(p_t, t_t)
),
)
return final_ret
Wohoooo! Seems, it is working?
@qibinc Thanks
One more question: how should I average name and type predictions? Just like (name_loss + type_loss / 2), or anything else?
Wow, what a dirty trick!
(Pdb) model.vocab.regs.id2word {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<IDENTITY>', 5: '56', 6: '64', 7: '16', 8: '8', 9: '24', 10: '72', 11: '80', 12: '32', 13: '496', 14: '104', 15: '48', 16: '120', 17: '112', 18: '512', 19: '128', 20: '528', 21: '544', 22: '560', 23: '88', 24: '576', 25: '592', 26: '608', 27: '40', 28: '440', 29: '432', 30: '96', 31: '1', 32: '1280', 33: '424', 34: '472', 35: '448', 36: '464', 37: '480', 38: '456', 39: '5', 40: '0', 41: '1288', 42: '624', 43: '1296', 44: '640', 45: '656', 46: '672', 47: '688', 48: '2', 49: '1284', 50: '144', 51: '704', 52: '192', 53: '720', 54: '736', 55: '1312', 56: '400', 57: '1344', 58: '1328', 59: '1360', 60: '1376', 61: '1408', 62: '1392', 63: '1424', 64: '3', 65: '1440', 66: '1456', 67: '1472', 68: '1488', 69: '1504', 70: '1520', 71: '1536', 72: '1552', 73: '1568', 74: '1584', 75: '1600', 76: '1616', 77: '1632', 78: '1648', 79: '1664', 80: '1696', 81: '176', 82: '368', 83: '384', 84: '1680', 85: '1712', 86: '1728', 87: '1760', 88: '1792', 89: '1824', 90: '500'}
{"name":"openWrite",
"code_tokens":["__int64","__fastcall","openWrite","(","const","char","*","@@a1@@",",","int","@@a2@@",")","{","int","@@v3@@",";","int","@@oflag@@",";","if","(","@@a2@@",")","@@oflag@@","=","Number",";","else","@@oflag@@","=","Number",";","@@v3@@","=","open","(","@@a1@@",",","@@oflag@@",",","Number","L",")",";","if","(","@@v3@@","<","Number",")","errnoAbort","(","String",",","@@a1@@",")",";","return","(","unsigned","int",")","@@v3@@",";","}"],
"source":
{
"s8":{"t":{"T":1,"n":"int","s":4},"n":"v3","u":false},
"s4":{"t":{"T":1,"n":"int","s":4},"n":"oflag","u":true},
"r56":{"t":{"T":3,"t":"const char"},"n":"a1","u":false},
"r64":{"t":{"T":1,"n":"int","s":4},"n":"a2","u":false}},
"target":{"s8":{"t":{"T":1,"n":"int","s":4},"n":"fd","u":true},
"s4":{"t":{"T":1,"n":"int","s":4},"n":"flags","u":true},
"r56":{"t":{"T":3,"t":"char"},"n":"fname","u":false},
"r64":{"t":{"T":1,"n":"int","s":4},"n":"append","u":false}},
"test_meta":{"function_name_in_train":false,"function_body_in_train":false}
}
dict_keys([
'index',
'src_code_tokens',
'variable_mention_to_variable_id',
'variable_mention_mask',
'variable_mention_num',
'variable_encoding_mask', '
target_type_src_mems',
'src_type_id',
'target_mask',
'target_submask',
'target_type_sizes'
])
(Pdb) model.vocab.names.id2word[5] = '' # first non-spec elem in names vocab starts on offset +5
(Pdb) model.vocab.types.id2word[7] = '__int64' # first non-spec elem in types vocab starts on offset +7
(Pdb) input_dict['index'] = [
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@a1@@'],
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@a2@@'],
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@v3@@'],
['74a2c0823cd15148ca542a6e3350cd617789ffb2e3c1f9a424e72190e3975875', 'openWrite', '@@oflag@@']]
(Pdb) input_dict['src_code_tokens'][0].numel() = 87
input_dict['src_code_tokens'][0] =
tensor(
[
1, 2069, 2008, 2012, 2010, 2063, 3251, 3877, 1995, 2088, 2046, 2001,
9917, 1226, 2007, 2021, 9917, 402, 1996, 2019, 2021, 9917, 1263, 1997,
2021, 9917, 1316, 1997, 2029, 1995, 9917, 402, 1996, 9917, 1316, 2009,
2004, 1997, 2082, 9917, 1316, 2009, 2004, 1997, 9917, 1263, 2009, 3251,
1995, 9917, 1226, 2007, 9917, 1316, 2007, 2004, 2027, 1996, 1997, 2029,
1995, 9917, 1263, 2038, 2004, 1996, 9917, 2506, 9981, 3733, 1995, 2065,
2007, 9917, 1226, 1996, 1997, 2049, 1995, 2036, 2021, 1996, 9917, 1263,
1997, 2020, 2
]
)
(Pdb) input_dict['variable_mention_to_variable_id'][0].numel() = 87
(Pdb) input_dict['variable_mention_to_variable_id'][0] =
tensor(
[
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0,
0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 2, 0, 0,
0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0
]
)
(Pdb) input_dict['variable_mention_mask'][0].numel() = 87
(Pdb) input_dict['variable_mention_mask'][0] =
tensor(
[
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.
]
)
(Pdb) input_dict['variable_mention_num'] = tensor([ [3., 2., 4., 4.] ])
(Pdb) input_dict['variable_encoding_mask'] = tensor([ [1., 1., 1., 1.] ])
(Pdb) input_dict['target_type_src_mems'] =
tensor(
[ #3d (func)
[ #2d (var)
[1035, 11, 3], #1d (var_mem_repr) offset if < 1024+3 else reg_num+3+3, size+3, field_offset+3
[1036, 7, 3],
[ 3, 7, 3],
[ 7, 7, 3]
]
]
)
(Pdb) model.vocab.regs.word2id['56'] = 5
(Pdb) model.vocab.regs.word2id['64'] = 6
# so, 1035-1024 = 11; 11 - 3 - 3 = 5; 5 == 5;
# and 1035-1024 = 12; 12 - 3 - 3 = 6; 6 == 6;
(Pdb) input_dict['src_type_id'] = tensor(
[
[9, 5, 5, 5] # vars type ids
]
)
(Pdb) model.vocab.types.word2id['const char *'] = 9 # "r56":{"t":{"T":3,"t":"const char"},"n":"a1","u":false},
(Pdb) model.vocab.types.word2id['int'] = 5 # "r64":{"t":{"T":1,"n":"int","s":4},"n":"a2","u":false}},
# src_type_id --> model.vocab.types
Hello. Thanks for you're kindness to share such a good project.
Could you please explain me why does we need encode_memory
this function, and why does it attempt to compare integers with '' ?
I can't get send of appending int token to array of int tokens instead of int token.
and second question is about this one:
what and why is 1030 constant do?
And in general, why we define tokens as that:
but using as this:
Sorry, if my questions in too much, I specialize on system programming, and math with ML is a hobby. Forward Thanks =)
@pcyin @qibinc @jlacomis