As suggested by @andreaskoepf, we should set the vocab size to be divisible by 128 by default, unless there is a good reason not to. This commit fixes this. Moreover, it was locally verified running tests/test_llama_weights.py, so there will not be any issues when converting the weights meta -> megatron -> shard -> unshard -> huggingface.
Any other change I should consider before merging @andreaskoepf? Maybe something in the megatron -> huggingface for larger models?
As suggested by @andreaskoepf, we should set the vocab size to be divisible by 128 by default, unless there is a good reason not to. This commit fixes this. Moreover, it was locally verified running
tests/test_llama_weights.py
, so there will not be any issues when converting the weights meta -> megatron -> shard -> unshard -> huggingface.Any other change I should consider before merging @andreaskoepf? Maybe something in the megatron -> huggingface for larger models?