cl-tohoku / bert-japanese

BERT models for Japanese text.
Apache License 2.0
514 stars 55 forks source link

strange tokenizer results with self-pretrained model #32

Open lightercs opened 2 years ago

lightercs commented 2 years ago

Hi, I trained a new vocab and bert model with my own datasets following your scripts, with the Mecab Dictionary being changed. but when I exam it, quite strange results returned everytime. Would you please help me check on this and give me some advice?

Details as below: My code:

from transformers import BertJapaneseTokenizer, BertForMaskedLM

model_name_or_path = "/content/drive/MyDrive/bert/new_bert/" 
tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path, mecab_kwargs={"mecab_option": "-d /content/drive/MyDrive/UniDic"})

model = BertForMaskedLM.from_pretrained(model_name_or_path)
input_ids = tokenizer.encode(f"青葉山で{tokenizer.mask_token}の研究をしています。", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(input_ids[0].tolist()))

masked_index = torch.where(input_ids == tokenizer.mask_token_id)[1][0].tolist()
print(masked_index)

result = model(input_ids)
pred_ids = result[0][:, masked_index].topk(5).indices.tolist()[0]
for pred_id in pred_ids:
    output_ids = input_ids.tolist()[0]
    output_ids[masked_index] = pred_id
    print(tokenizer.decode(output_ids))

the result:

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'BertJapaneseTokenizer'.
Some weights of the model checkpoint at /content/drive/MyDrive/bert/new_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']
4
[CLS] 青葉山で ヒダ の研究をしています 。 [SEP]
[CLS] 青葉山で 宿つ の研究をしています 。 [SEP]
[CLS] 青葉山で 法外 の研究をしています 。 [SEP]
[CLS] 青葉山で 頑丈 の研究をしています 。 [SEP]
[CLS] 青葉山で弱 の研究をしています 。 [SEP]

the tokenize result is firstly quite odd as below, and then the predict results.

['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']

but when I change to your pre-trained tokenizer bert-base-v2 (still use my model), the result changed alot.

Some weights of the model checkpoint at /content/drive/MyDrive/kindai_bert/kindai_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
4
[CLS] 青葉 山 で 宮司 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 飛翔 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 旧来 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 生野 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 でד の 研究 を し て い ます 。 [SEP]

My local bert folder is like: image

Thank you in advance.

singletongue commented 2 years ago

Hi, @lightercs.

Would you please show me the content of your config.json file? There could be some misconfiguration on using the tokenizer.

Thank you.

lightercs commented 2 years ago

@singletongue , thank you for your reply! I remembered I did't change any paremeter just use your ones, and the config.json file is as below pic image

Should I change which paramether when traning? ( Only datasets and Mecab dictionary is changed, since the vocab_size remains the same.

singletongue commented 2 years ago

OK then, what about initializing the tokenizer by the following line?

tokenizer = BertJapaneseTokenizer.from_pretrained(
    model_name_or_path,
    do_lower_case=False,
    word_tokenizer_type="mecab",
    subword_tokenizer_type="wordpiece",
    mecab_kwargs={"mecab_option": "-d /content/drive/MyDrive/UniDic"}
)

You may have to modify some of the values for your configuration.

Thank you.

lightercs commented 2 years ago

@singletongue , thank you for the suggestion ! I modified the config you suggested. and it seems that the tokenizer.convert_ids_to_tokens(input_ids[0].tolist()) function behaves normal below, but the predict results still appears quite different from when applying cl-tohoku/bert-base-v2.

Local self-trained model, with the local moder's tokenizer:

['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'います', '。', '[SEP]']
4
[CLS] 青葉 山 で ヒダ の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 宿つ の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 赤裸 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 石 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で迹 の 研究 を し て います 。 [SEP] 

Local self-trained model, with tohoku/bert-base-v2 :

['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'いま', '##す', '。', '[SEP]']
4
[CLS] 青葉 山 で 宮司 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 飛翔 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で稽 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で IBM の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 高かっ の 研究 を し て います 。 [SEP]

Also interesting thing is that, I can understand that these two tokenizers can be exchanged to some degree, because their vocab size is the same, but when I try this pattern My local model + your bert-base-japanese(version 1st, with vocab-size :32000), although the vocab size is unmatch, but the result below seems quite resonable and much better than the above two.


4
[CLS] 青葉 山 で ダイヤモンド の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 粘膜 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で蝶 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 師範 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 原動力 の 研究 を し て います 。 [SEP] ```

What's your opinions on this?  It remained me more one details that when training the tokenizers, in order to give a new dictionary address to MeCab, I specified the mecab dict to ``` "-d /content/drive/MyDrive/UniDic ```, and  added it as a new ```mecab_option = UniDic```.  Is there any relation?

Thank you.
singletongue commented 2 years ago

Could you show me the full command you executed when you trained the tokenizer? (i.e., python train_tokenizer.py <all_the_options_you_specified>)

Thank you.

lightercs commented 2 years ago

Hi @singletongue , sorry for missing your update and being response late. I added os.environ["TOKENIZERS_PARALLELISM"] = "false" to the train_tokenizer.py script, changed from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
to from transformers import BertJapaneseTokenizer due to import error, and below is my conmand :

python train_tokenizer.py --input_files=D:\Data\BERT\Data\total.txt --output_dir=D:\Data\BERT\Model\bert\Vocab\ --tokenizer_type=wordpiece --mecab_dic_type=unidic --vocab_size=32768 --limit_alphabet=6129 --num_unused_tokens=10

Thank you greatly!

singletongue commented 2 years ago

Thank you for the information. I understand that, when training the tokenizer, you did not specify the mecab path (/content/drive/MyDrive/UniDic) which you first mentioned. Then, would you give it a try initializing the tokenizer by the following line?

tokenizer = BertJapaneseTokenizer.from_pretrained(
    model_name_or_path,
    do_lower_case=False,
    word_tokenizer_type="mecab",
    subword_tokenizer_type="wordpiece",
    mecab_kwargs={"mecab_dic": "unidic"},
}
lightercs commented 2 years ago

Hi @singletongue, Thank you for your reply. Sorry for my poor explaination. I actually specified the new mecab path as a new meacb Dic("unidic") in the pre-tokenizers.py file and some other places related with mecab options. The code is as show in the pic. image

And I tried to change the mecab_dic to unidic when eaxming, ValueError: Invalid mecab_dic is specified. returned.

Is it that because I changed the mecab_optionand mecab_dic_typein the local pre-tokenizers.py files when training the tokenizers, so although I changed mecab_kwargs and specified the new Dic path, the Tokenizer still can not be the same with the situation I used to training it, becasue in the Transformers library, BertJapaneseTokenizer.from_pretrained method remains the same when I exam in Colab ?

If so, how should I do if I want to training a new Vocab with a new Dict when training, and use it afterwards?

Thanks in advance.

singletongue commented 2 years ago

Yes, it seems that the inconsistent configuration of the tokenizers between your modified version and our (huggingface's) one is causing the problem.

Could you show me the full traceback you get when the ValueError: Invalid mecab_dic is specified is raised? And, could you specify what version of transformers library you're using?

lightercs commented 2 years ago

@singletongue , thank you greatly for your reply! Then should I change the transformers related dist-packages on Colab, like tokenizer.py file just as I used to do when local training ?

The transformers library I'm using when examing is transformers==4.18.0. and the traceback code when error I pasted below, (since if I name the new mecab_dic to unidic, it will be duplicated with another library, I replaced it with Unidic. )

The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'BertJapaneseTokenizer'.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-15-e6e9378d9665>](https://localhost:8080/#) in <module>
      7     word_tokenizer_type="mecab",
      8     subword_tokenizer_type="wordpiece",
----> 9     mecab_kwargs={"mecab_dic":"Unidic"}
     10 )
     11 

3 frames
[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1785             use_auth_token=use_auth_token,
   1786             cache_dir=cache_dir,
-> 1787             **kwargs,
   1788         )
   1789 

[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
   1913         # Instantiate tokenizer.
   1914         try:
-> 1915             tokenizer = cls(*init_inputs, **init_kwargs)
   1916         except OSError:
   1917             raise OSError(

[/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py](https://localhost:8080/#) in __init__(self, vocab_file, do_lower_case, do_word_tokenize, do_subword_tokenize, word_tokenizer_type, subword_tokenizer_type, never_split, unk_token, sep_token, pad_token, cls_token, mask_token, mecab_kwargs, **kwargs)
    150             elif word_tokenizer_type == "mecab":
    151                 self.word_tokenizer = MecabTokenizer(
--> 152                     do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})
    153                 )
    154             else:

[/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py](https://localhost:8080/#) in __init__(self, do_lower_case, never_split, normalize_text, mecab_dic, mecab_option)
    278 
    279             else:
--> 280                 raise ValueError("Invalid mecab_dic is specified.")
    281 
    282             mecabrc = os.path.join(dic_dir, "mecabrc")

ValueError: Invalid mecab_dic is specified. 

Should I change the tokenizers and transformers library to this ? -->tokenizers==0.9.2 transformers==3.4.0 , I tried it and the error traceback as below:

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'BertJapaneseTokenizer'.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-23-e6e9378d9665>](https://localhost:8080/#) in <module>
      7     word_tokenizer_type="mecab",
      8     subword_tokenizer_type="wordpiece",
----> 9     mecab_kwargs={"mecab_dic":"Unidic"}
     10 )
     11 

3 frames
[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1785             return obj
   1786 
-> 1787         # add_type_field=True to allow dicts in the kwargs / differentiate from AddedToken serialization
   1788         tokenizer_config = convert_added_tokens(tokenizer_config, add_type_field=True)
   1789         with open(tokenizer_config_file, "w", encoding="utf-8") as f:

[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
   1913         """
   1914         Find the correct padding/truncation strategy with backward compatibility
-> 1915         for old arguments (truncation_strategy and pad_to_max_length) and behaviors.
   1916         """
   1917         old_truncation_strategy = kwargs.pop("truncation_strategy", "do_not_truncate")

/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py in __init__(self, vocab_file, do_lower_case, do_word_tokenize, do_subword_tokenize, word_tokenizer_type, subword_tokenizer_type, never_split, unk_token, sep_token, pad_token, cls_token, mask_token, mecab_kwargs, **kwargs)

/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py in __init__(self, do_lower_case, never_split, normalize_text, mecab_dic, mecab_option)

ValueError: Invalid mecab_dic is specified.

Thank you for your time.

singletongue commented 2 years ago

Thank you for the information, @lightercs. Yes, it seems that you should use your custom tokenization files as you did when you performed training, since the transformers library does not know anything about your customization of the tokenization scripts.

lightercs commented 2 years ago

Hi, @Masatoshi Suzuki @.***> I tried to change the mecab_dic options on Colab's transformers/tokenization_utils_base.py] Script since your last advice, but it seems transformer still cannot accept the new mecab_dic option I added. Am I changed the wrong place? If so, how's the right way to specify a new mecab_dic when training tokenizers, and use it after bert model trained.(Btw, I tried to contact you by Email few days ago, but your Tohoku-university email seems no longer available :( .