add_tokens for tokenizer

OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Apache License 2.0

2.39k stars 248 forks source link

add_tokens for tokenizer #382

Closed SilyRab closed 1 year ago

SilyRab commented 1 year ago

Why add these tokens for tokenizer? tokenizer.add_tokens(["<code_{}>".format(i) for i in range(8192)]) okenizer.add_tokens(["<bin_{}>".format(i) for i in range(1000)])

huizyuan commented 1 year ago

I think the code token means the token of image patch. Meanwhile, I wonder how does the number of tokens (i.e., 8192) come from? How to embedding an image patch (16*16 pixel) to a single code?

SilyRab commented 1 year ago

It is more alikely about code generation. I have found some clues in the codes.

logicwong commented 1 year ago

@huizyuan We use VQGAN to transform the raw images into codes. 8192 is the number of codebook entries.