facebookresearch / metaseq

Repo for external large-scale work
MIT License
6.45k stars 723 forks source link

why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265. #469

Open Zcchill opened 1 year ago

Zcchill commented 1 year ago

🐛 Bug

The "vocab_size" in config file is 50272 but the len(tokenizer) is 50265, they not match eacch other. ### To Reproduce Steps to reproduce the behavior (**always include the command you ran**): 1. Run cmd '....' 2. See error None #### Code sample model.resize_token_embeddings(len(tokenizer)) ### Expected behavior The results seem good when I use the code abbove to align to tokenizer, but I just wonder why the vocab size for training is 50272, did I miss some important parameter? ### Environment - metaseq Version (e.g., 1.0 or master): - PyTorch Version (e.g., 1.0) - OS (e.g., Linux, Windows, MacOS): - How you installed metaseq (`pip`, source): - Build command you used (if compiling from source): - Python version: - CUDA/cuDNN version: - GPU models and configuration: - Any other relevant information: ### Additional context
suchenzang commented 1 year ago

Tokenizer saved has length 50265 but then we add 4 special tokens: https://github.com/facebookresearch/metaseq/blob/e2df6a021cc5ee024533427ae476ce29cdb65b66/metaseq/tasks/streaming_language_modeling.py#L158 which gives us a dictionary vocab size of 50269 at this point. This is followed by a pad_to_multiple(8): https://github.com/facebookresearch/metaseq/blob/e2df6a021cc5ee024533427ae476ce29cdb65b66/metaseq/tasks/streaming_language_modeling.py#L169, which is why vocab size ends up being 50272.

Zcchill commented 1 year ago

@suchenzang - Thank you for your answering! It seems that the 4 special token have already been among the 50265 tokens. It seems that only pad_to_multiple(8): make the vocab size from 50265 to 50272. what I mean is that are id 50265-50272 all "madeupword"?

tokenizer = AutoTokenizer.from_pretrained('facebook/opt-125m', use_fast=False)
model = AutoModelForCausalLM.from_pretrained('facebook/opt-125m',cache_dir='/ssdwork/cache/').cuda()
all_text  ='Which poem is the best one, and please write it to me.'
input_ids = tokenizer(all_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, do_sample=False, max_length=256, num_beams=1)
output_decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
output_decode = output_decode[0]
print(output_decode) # the result:Which poem is the best one, and please write it to me.\nI'm not sure, but I think it's the one by the author of the poem.
model.resize_token_embeddings(len(tokenizer))
all_text  ='Which poem is the best one, and please write it to me.'
input_ids = tokenizer(all_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, do_sample=False, max_length=256, num_beams=1)
output_decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
output_decode = output_decode[0]
print(output_decode) # the result:Which poem is the best one, and please write it to me.\nI'm not sure, but I think it's the one by the author of the poem.
baiyuting commented 1 year ago

I have the same question, and if it is ok to use a roberta tokenizer instead ?

Gusicun commented 1 year ago

Same Questions,Will it cause Index Error?