NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.58k stars 2.37k forks source link

(Pre training mamba with train.sh) Error : GPT2BPETokenizer : assert args.vocab_file is not None #958

Open SkanderBS2024 opened 3 months ago

SkanderBS2024 commented 3 months ago

Hello,

after pre-processing the dataset with a BPE tokenizer, when i launch the 'train.sh' script for mamba i do get this error.

Capture d’écran 2024-07-30 à 11 59 14 PM

In the script it's mentionned that i have to specify the tokenizer path, i did put the path for the "tokenizer.json" and when inspecting the code it expects the vocab.json and merge.json files does anyone have an idea about the path i should give as an argument ?

this is the command that launches the training script :

/workspace/megatron/examples/mamba# ./train.sh /workspace/megatron/examples/mamba/dataset/CleanData2/preprocessed2_text_document /workspace/megatron/examples/mamba/dataset/mamba_tokenizer/tokenizer.json

Assuming that CleanData2 contains the .bin and .idx processed data Assuming that mamba_tokenizer contains :

Capture d’écran 2024-07-31 à 12 03 36 AM
tbsxxxH commented 3 months ago

Hello,

after pre-processing the dataset with a BPE tokenizer, when i launch the 'train.sh' script for mamba i do get this error.

Capture d’écran 2024-07-30 à 11 59 14 PM

In the script it's mentionned that i have to specify the tokenizer path, i did put the path for the "tokenizer.json" and when inspecting the code it expects the vocab.json and merge.json files does anyone have an idea about the path i should give as an argument ?

this is the command that launches the training script :

/workspace/megatron/examples/mamba# ./train.sh /workspace/megatron/examples/mamba/dataset/CleanData2/preprocessed2_text_document /workspace/megatron/examples/mamba/dataset/mamba_tokenizer/tokenizer.json

Assuming that CleanData2 contains the .bin and .idx processed data Assuming that mamba_tokenizer contains :

Capture d’écran 2024-07-31 à 12 03 36 AM

Have you solved it? I also have this problem.

SkanderBS2024 commented 3 months ago

@tbsxxxH

I've temporarily hard coded the paths here :

https://github.com/NVIDIA/Megatron-LM/blob/c873429cbaa43257d4d4fc01df2a7a50453b7984/megatron/training/tokenizer/tokenizer.py#L38-L40

tbsxxxH commented 3 months ago

@tbsxxxH

I've temporarily hard coded the paths here :

https://github.com/NVIDIA/Megatron-LM/blob/c873429cbaa43257d4d4fc01df2a7a50453b7984/megatron/training/tokenizer/tokenizer.py#L38-L40

How to do it specifically? I changed it to the following but still got an error:

        assert args.vocab_file is not None
        assert args.merge_file is not None
        tokenizer = _GPT2BPETokenizer('/home/eva.liu/Megatron-LM/vocab.json', '/home/eva.liu/Megatron-LM/merges.txt')
SkanderBS2024 commented 3 months ago

remove the assertions and declare a variable for each path and give them as parameter to "_GPT2BPETokenizer"

tbsxxxH commented 3 months ago

Thank you, the problem is solved.

zixianwang2022 commented 2 months ago

Hi, I think Mamba is using GPTSentencePieceTokenizer as described in the train.sh.

       --tokenizer-type GPTSentencePieceTokenizer \
       --tokenizer-model ${TOKENIZER_PATH} \
github-actions[bot] commented 5 days ago

Marking as stale. No activity in 60 days.