Open nakroy opened 1 month ago
I switch to use 'HuggingFaceTokenizer' as the arg 'tokenizer-type', but there will be some other bugs.
The problem is that unique_identifiers
is not implemented in Llama3Tokenizer which is not inherited from MegatronTokenizer. Changing the lines https://github.com/NVIDIA/Megatron-LM/blob/9bcd4175becc515331537f0c78eb70079de0eaa8/megatron/training/tokenizer/tokenizer.py#L567-L569
to following should solve the problem:
class _Llama3Tokenizer(Llama3Tokenizer):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.unique_identifiers = OrderedDict()
self.unique_identifiers["class"] = type(self).__name__
self.unique_identifiers["tokenizer_path"] = args if len(args) > 0 else ["n/a"]
for option in kwargs:
self.unique_identifiers[option] = str(kwargs[option])
self.unique_description = json.dumps(self.unique_identifiers, indent=4)
The problem is that
unique_identifiers
is not implemented in Llama3Tokenizer which is not inherited from MegatronTokenizer. Changing the linesto following should solve the problem:
class _Llama3Tokenizer(Llama3Tokenizer): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.unique_identifiers = OrderedDict() self.unique_identifiers["class"] = type(self).__name__ self.unique_identifiers["tokenizer_path"] = args if len(args) > 0 else ["n/a"] for option in kwargs: self.unique_identifiers[option] = str(kwargs[option]) self.unique_description = json.dumps(self.unique_identifiers, indent=4)
Thanks, it works for me. It seems that Llama3Tokenizer still needs to be fixed with some little problems before it really can be used for finetuning properly...
I switch to use 'HuggingFaceTokenizer' as the arg 'tokenizer-type', but there will be some other bugs.
I think Llama3Tokenizer is the suitable one for llama3 model training, but it's unstable while using it as far, or maybe the arguments I set is not proper, because I just changed some arguments from the scripts that I used to finetune llama2
Describe the bug I try to finetune
llama3-8B
model with multi nodes but get an AtrributeError when finishing loading mcore format checkpoint and starting to build datasets, the error is below:AttributeError: '_Llama3Tokenizer' object has no attribute 'unique_identifiers'
To Reproduce
The finetune dataset I use is downloading from https://huggingface.co/datasets/tatsu-lab/alpaca/blob/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet. I convert it into
json
format after downloading it.The preprocessing script I used is as follow:
MODEL_PATH=/workspace/model_weights/llama3-8b TOKENIZER_MODEL=${MODEL_PATH}/original/tokenizer.model OUTPUT_DIR=/workspace/dataset/finetune_dataset/llama3-8b OUTPUT_PREFIX=${OUTPUT_DIR}/alpaca TOKENIZER_TYPE=Llama3Tokenizer
mkdir -p ${OUTPUT_DIR}
python ./tools/preprocess_data.py \ --input ${INPUT_FILE} \ --output-prefix ${OUTPUT_PREFIX} \ --tokenizer-model ${TOKENIZER_MODEL} \ --workers 4 \ --log-interval 1000 \ --tokenizer-type ${TOKENIZER_TYPE} \ --append-eod
Environment (please complete the following information):