cl-tohoku / bert-japanese

BERT models for Japanese text.
Apache License 2.0
514 stars 55 forks source link

Fine tune #8

Closed nuwanq closed 4 years ago

nuwanq commented 4 years ago

Do you have some guide to fine tune bert-japanese

I tried to fine tune, and result is not good. Seems like I did some thing wrong. Since GPU training is bit expensive, I like to have some opinion from you before finetune again .

Do I need to separate words using mecab-neologd ? Do I need to do some thing to tokenizer before fine tune ?

singletongue commented 4 years ago

Are you using the correct tokenizer defined in our repository? MecabBertTokenizer defined in tokenization.py handles basic tokenization with MeCab.

It would be always helpful if some examples of inputs and outputs you have tried were included. Thank you.

nuwanq commented 4 years ago

Are you using the correct tokenizer defined in our repository? MecabBertTokenizer defined in tokenization.py handles basic tokenization with MeCab.

It would be always helpful if some examples of inputs and outputs you have tried were included. Thank you.

This is the code I used to fine tune.

cmd =   """

python3 run_language_modeling.py
    --train_data_file ./train
    --eval_data_file ./valid
    --output_dir ./test
    --model_type bert
    --model_name_or_path ./org_model
    --mlm
    --config_name ./org_model
    --tokenizer_name ./org_model
    --do_train
    --do_eval
    --line_by_line
    --learning_rate 5e-5
    --num_train_epochs 5
    --save_total_limit 20
    --save_steps 5000
    --per_gpu_train_batch_size 8
    --warmup_steps=5000
    --logging_steps=100
    --gradient_accumulation_steps=4
    --mlm_probability=0.15
    --seed 666 
    --block_size=512
""".replace("\n", " ")

and out put gives

03/27/2020 00:35:43 - INFO - transformers.tokenization_utils -   Model name './org_model' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, bert-base-finnish-cased-v1, bert-base-finnish-uncased-v1, bert-base-dutch-cased). Assuming './org_model' is a path, a model identifier, or url to a directory containing tokenizer files.
03/27/2020 00:35:43 - INFO - transformers.tokenization_utils -   Didn't find file ./org_model/added_tokens.json. We won't load it.
03/27/2020 00:35:43 - INFO - transformers.tokenization_utils -   Didn't find file ./org_model/special_tokens_map.json. We won't load it.
03/27/2020 00:35:43 - INFO - transformers.tokenization_utils -   Didn't find file ./org_model/tokenizer_config.json. We won't load it.
03/27/2020 00:35:43 - INFO - transformers.tokenization_utils -   loading file ./org_model/vocab.txt
03/27/2020 00:35:43 - INFO - transformers.tokenization_utils -   loading file None
03/27/2020 00:35:43 - INFO - transformers.tokenization_utils -   loading file None
03/27/2020 00:35:43 - INFO - transformers.tokenization_utils -   loading file None

Which means I didn't used "MecabBertTokenizer". I searched a bit and couldn't find how to use it. How can I used tokenization.py to fine tune.

Thank you very much for the help

singletongue commented 4 years ago

Could you try again with --model_name_or_path bert-base-japanese --config_name bert-base-japanese --tokenizer_name bert-base-japanese?

As you may know, our models and tokenizers are included in Hugging Face's Transformers, and they are easily available by names like bert-base-japanese . https://huggingface.co/transformers/pretrained_models.html

nuwanq commented 4 years ago

Could you try again with --model_name_or_path bert-base-japanese --config_name bert-base-japanese --tokenizer_name bert-base-japanese?

As you may know, our models and tokenizers are included in Hugging Face's Transformers, and they are easily available by names like bert-base-japanese . https://huggingface.co/transformers/pretrained_models.html

Thank you for the guidance. I will try it.

nuwanq commented 4 years ago

I tried with your parameters and seems it cannot find "bert-base-japanese" tokenizer

cmd =   """
python3 run_language_modeling.py
    --train_data_file ./train.txt
    --eval_data_file ./valid.txt
    --output_dir ./test2
    --model_type bert
    --model_name_or_path bert-base-japanese 
    --mlm
    --config_name bert-base-japanese
    --tokenizer_name bert-base-japanese
    --do_train
    --do_eval
    --line_by_line
    --learning_rate 5e-5
    --num_train_epochs 5
    --save_total_limit 20
    --save_steps 5000
    --per_gpu_train_batch_size 8
    --warmup_steps=5000
    --logging_steps=100
    --gradient_accumulation_steps=4
    --mlm_probability=0.15
    --seed 666 
    --block_size=512
""".replace("\n", " ")
03/31/2020 05:09:43 - INFO - transformers.tokenization_utils -   Model name 'bert-base-japanese' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, bert-base-finnish-cased-v1, bert-base-finnish-uncased-v1, bert-base-dutch-cased). Assuming 'bert-base-japanese' is a path, a model identifier, or url to a directory containing tokenizer files.
Traceback (most recent call last):
  File "run_language_modeling.py", line 799, in <module>
    main()
  File "run_language_modeling.py", line 706, in main
    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 393, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 496, in _from_pretrained
    list(cls.vocab_files_names.values()),
OSError: Model name 'bert-base-japanese' was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, bert-base-finnish-cased-v1, bert-base-finnish-uncased-v1, bert-base-dutch-cased). We assumed 'bert-base-japanese' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

sorry for the trouble. What should I do next.

singletongue commented 4 years ago

Sorry, could you try again with --model_type bert-japanese ? If this fails, try updating transformers to the latest version.

nuwanq commented 4 years ago

Sorry again, I tried it and gave this error.

03/31/2020 05:34:45 - WARNING - __main__ -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
Traceback (most recent call last):
  File "run_language_modeling.py", line 799, in <module>
    main()
  File "run_language_modeling.py", line 696, in main
    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
KeyError: 'bert-japanese'

I already updated the transformers.

Sorry for the trouble and thank you.

singletongue commented 4 years ago

Maybe the example script you're running is old (I assume it's from Transformers repo.) Could you obtain the latest version and try again?

nuwanq commented 4 years ago

Thank you for the reply. Previously I only downloaded run_language_modeling.py separately and installed transformers by pip. And I did not run the cmd inside the transformers directory.

This time I installed it from repo. Now I run the script inside transfomers/examples and it worked. I'm not sure which fixed the issue.

Probably it may be the example script is old as you said. Thank you very much for you time and for the help.