HuangLK / transpeeder

train llama on a single A100 80G node using 🤗 transformers and 🚀 Deepspeed Pipeline Parallelism
Apache License 2.0
208 stars 18 forks source link

Why do we need to add 1 to the vocab_size when constructing the model? #30

Open forceshorty opened 1 year ago

forceshorty commented 1 year ago

https://github.com/HuangLK/llama-deepspeed/blob/faedea514b11c18c695e1b2a6adb63b102ef001c/models/llama_pipeline_model.py#LL159C33-L159C33

image

HuangLK commented 1 year ago

https://github.com/HuangLK/llama-deepspeed/blob/faedea514b11c18c695e1b2a6adb63b102ef001c/scripts/convert2ckpt.py#L65 Here is the hard code to add pad_token

forceshorty commented 1 year ago

Thank you for your answer. There is another question: why was not vocab_size increased by 1 in the convert2hf.py script, the original vocab_size is being used? https://github.com/HuangLK/llama-deepspeed/blob/faedea514b11c18c695e1b2a6adb63b102ef001c/scripts/convert2hf.py#L43