Additional Found:
It seems that tools/checkpoint/convert.py would call the function from tools/checkpoint/loader_llama_mistral.py to load the HF format checkpoint, and when the case is llama3, it would use Llama3Tokenizer to get true vocab size as shown in 557~563 line in tools/checkpoint/loader_llama_mistral.py.
So I change the code md.true_vocab_size = tokenizer.vocab_size into md.true_vocab_size = tokenizer.n_words, and it successfully converted to mcore format.
But I'm still not sure whether it's a bug or whether the operation I did is wrong that cause the failure of converting llama3-8B model.
I occur recently some similar problems to yours, and my conclusion is that the tokenizer class used in Llama and the method in Megatron are not consistent.
Describe the bug Get an AtrributeError when trying to convert llama3-8B model from HF format to mcore format, the error is below:
AttributeError: 'Tokenizer' object has no attribute 'vocab_size'
To Reproduce
pip install -e .
to install llama3-0.0.1 wheel following to llama_mistral.md;PP=2
MODEL_SIZE=llama3-8B
HF_FORMAT_DIR=/workspace/model_weights/llama3-8b
MEGATRON_FORMAT_DIR=${HF_FORMAT_DIR}-tp${TP}-pp${PP}
TOKENIZER_MODEL=${HF_FORMAT_DIR}/original/tokenizer.model
python tools/checkpoint/convert.py \ --model-type GPT \ --loader llama_mistral \ --saver mcore \ --checkpoint-type hf \ --model-size ${MODEL_SIZE} \ --load-dir ${HF_FORMAT_DIR} \ --save-dir ${MEGATRON_FORMAT_DIR} \ --tokenizer-model ${TOKENIZER_MODEL} \ --target-tensor-parallel-size ${TP} \ --target-pipeline-parallel-size ${PP} \ --bf16
Environment (please complete the following information):
Additional Found: It seems that
tools/checkpoint/convert.py
would call the function fromtools/checkpoint/loader_llama_mistral.py
to load the HF format checkpoint, and when the case isllama3
, it would useLlama3Tokenizer
to get true vocab size as shown in 557~563 line intools/checkpoint/loader_llama_mistral.py
.But I check out the Llama3Tokenizer definition which is from https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py and did not find the attribute
vocab_size
, but maybe it means the same attribute as following line show (which is defined in 86 line from https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py:So I change the code
md.true_vocab_size = tokenizer.vocab_size
intomd.true_vocab_size = tokenizer.n_words
, and it successfully converted to mcore format.But I'm still not sure whether it's a bug or whether the operation I did is wrong that cause the failure of converting llama3-8B model.