Open SkanderBS2024 opened 3 months ago
Hello,
after pre-processing the dataset with a BPE tokenizer, when i launch the 'train.sh' script for mamba i do get this error.
In the script it's mentionned that i have to specify the tokenizer path, i did put the path for the "tokenizer.json" and when inspecting the code it expects the vocab.json and merge.json files does anyone have an idea about the path i should give as an argument ?
this is the command that launches the training script :
/workspace/megatron/examples/mamba# ./train.sh /workspace/megatron/examples/mamba/dataset/CleanData2/preprocessed2_text_document /workspace/megatron/examples/mamba/dataset/mamba_tokenizer/tokenizer.json
Assuming that CleanData2 contains the .bin and .idx processed data Assuming that mamba_tokenizer contains :
Have you solved it? I also have this problem.
@tbsxxxH
I've temporarily hard coded the paths here :
@tbsxxxH
I've temporarily hard coded the paths here :
How to do it specifically? I changed it to the following but still got an error:
assert args.vocab_file is not None
assert args.merge_file is not None
tokenizer = _GPT2BPETokenizer('/home/eva.liu/Megatron-LM/vocab.json', '/home/eva.liu/Megatron-LM/merges.txt')
remove the assertions and declare a variable for each path and give them as parameter to "_GPT2BPETokenizer"
Thank you, the problem is solved.
Hi, I think Mamba is using GPTSentencePieceTokenizer as described in the train.sh
.
--tokenizer-type GPTSentencePieceTokenizer \
--tokenizer-model ${TOKENIZER_PATH} \
Marking as stale. No activity in 60 days.
Hello,
after pre-processing the dataset with a BPE tokenizer, when i launch the 'train.sh' script for mamba i do get this error.
In the script it's mentionned that i have to specify the tokenizer path, i did put the path for the "tokenizer.json" and when inspecting the code it expects the vocab.json and merge.json files does anyone have an idea about the path i should give as an argument ?
this is the command that launches the training script :
/workspace/megatron/examples/mamba# ./train.sh /workspace/megatron/examples/mamba/dataset/CleanData2/preprocessed2_text_document /workspace/megatron/examples/mamba/dataset/mamba_tokenizer/tokenizer.json
Assuming that CleanData2 contains the .bin and .idx processed data Assuming that mamba_tokenizer contains :