llama2 & vocabulary padding (making embedding layer sizes divisible by 128)

epfLLM / Megatron-LLM

distributed trainer for LLMs

Other

529 stars 76 forks source link

In weights2megatron.py for llama models args.make_vocab_size_divisible_by = 1 is set and this setting seems to be effective until conversion to hf (here are args of a 70b). This is different to falcon models which actually use a padded vocabulary with dummy tokens and this applies then also for the export to huggingface, e.g. see OpenAssistant/falcon-40b-megacode2-oasst/blob/main/config.json#L24.

Normally the size is padded to a value divisible by 128 to improve the model efficiency. Does this padding not have a beneficial effect for llama2? Was there a good reason not to use the megatron default value of 128?

epfLLM / Megatron-LLM

llama2 & vocabulary padding (making embedding layer sizes divisible by 128) #50