Closed SilyRab closed 1 year ago
I think the code token means the token of image patch. Meanwhile, I wonder how does the number of tokens (i.e., 8192) come from? How to embedding an image patch (16*16 pixel) to a single code?
It is more alikely about code generation. I have found some clues in the codes.
Why add these tokens for tokenizer?
tokenizer.add_tokens(["<code_{}>".format(i) for i in range(8192)])
okenizer.add_tokens(["<bin_{}>".format(i) for i in range(1000)])