Open lightercs opened 2 years ago
Hi, @lightercs.
Would you please show me the content of your config.json
file?
There could be some misconfiguration on using the tokenizer.
Thank you.
@singletongue , thank you for your reply!
I remembered I did't change any paremeter just use your ones, and the config.json
file is as below pic
Should I change which paramether when traning?
( Only datasets and Mecab dictionary is changed, since the vocab_size
remains the same.
OK then, what about initializing the tokenizer by the following line?
tokenizer = BertJapaneseTokenizer.from_pretrained(
model_name_or_path,
do_lower_case=False,
word_tokenizer_type="mecab",
subword_tokenizer_type="wordpiece",
mecab_kwargs={"mecab_option": "-d /content/drive/MyDrive/UniDic"}
)
You may have to modify some of the values for your configuration.
Thank you.
@singletongue , thank you for the suggestion !
I modified the config you suggested. and it seems that the tokenizer.convert_ids_to_tokens(input_ids[0].tolist())
function behaves normal below, but the predict results still appears quite different from when applying cl-tohoku/bert-base-v2
.
Local self-trained model, with the local moder's tokenizer:
['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'います', '。', '[SEP]']
4
[CLS] 青葉 山 で ヒダ の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 宿つ の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 赤裸 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 石 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で迹 の 研究 を し て います 。 [SEP]
Local self-trained model, with tohoku/bert-base-v2 :
['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'いま', '##す', '。', '[SEP]']
4
[CLS] 青葉 山 で 宮司 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 飛翔 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で稽 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で IBM の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 高かっ の 研究 を し て います 。 [SEP]
Also interesting thing is that, I can understand that these two tokenizers can be exchanged to some degree, because their vocab size is the same, but when I try this pattern My local model + your bert-base-japanese(version 1st, with vocab-size :32000)
, although the vocab size is unmatch, but the result below seems quite resonable and much better than the above two.
4
[CLS] 青葉 山 で ダイヤモンド の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 粘膜 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で蝶 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 師範 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 原動力 の 研究 を し て います 。 [SEP] ```
What's your opinions on this? It remained me more one details that when training the tokenizers, in order to give a new dictionary address to MeCab, I specified the mecab dict to ``` "-d /content/drive/MyDrive/UniDic ```, and added it as a new ```mecab_option = UniDic```. Is there any relation?
Thank you.
Could you show me the full command you executed when you trained the tokenizer?
(i.e., python train_tokenizer.py <all_the_options_you_specified>
)
Thank you.
Hi @singletongue , sorry for missing your update and being response late.
I added os.environ["TOKENIZERS_PARALLELISM"] = "false"
to the train_tokenizer.py
script,
changed from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
to from transformers import BertJapaneseTokenizer
due to import error, and below is my conmand :
python train_tokenizer.py --input_files=D:\Data\BERT\Data\total.txt --output_dir=D:\Data\BERT\Model\bert\Vocab\ --tokenizer_type=wordpiece --mecab_dic_type=unidic --vocab_size=32768 --limit_alphabet=6129 --num_unused_tokens=10
Thank you greatly!
Thank you for the information. I understand that, when training the tokenizer, you did not specify the mecab path (/content/drive/MyDrive/UniDic
) which you first mentioned.
Then, would you give it a try initializing the tokenizer by the following line?
tokenizer = BertJapaneseTokenizer.from_pretrained(
model_name_or_path,
do_lower_case=False,
word_tokenizer_type="mecab",
subword_tokenizer_type="wordpiece",
mecab_kwargs={"mecab_dic": "unidic"},
}
Hi @singletongue, Thank you for your reply.
Sorry for my poor explaination. I actually specified the new mecab path as a new meacb Dic("unidic") in the pre-tokenizers.py
file and some other places related with mecab options. The code is as show in the pic.
And I tried to change the mecab_dic
to unidic
when eaxming, ValueError: Invalid mecab_dic is specified.
returned.
Is it that because I changed the mecab_option
and mecab_dic_type
in the local pre-tokenizers.py
files when training the tokenizers, so although I changed mecab_kwargs
and specified the new Dic path, the Tokenizer still can not be the same with the situation I used to training it, becasue in the Transformers library, BertJapaneseTokenizer.from_pretrained
method remains the same when I exam in Colab ?
If so, how should I do if I want to training a new Vocab with a new Dict when training, and use it afterwards?
Thanks in advance.
Yes, it seems that the inconsistent configuration of the tokenizers between your modified version and our (huggingface's) one is causing the problem.
Could you show me the full traceback you get when the ValueError: Invalid mecab_dic is specified
is raised?
And, could you specify what version of transformers
library you're using?
@singletongue , thank you greatly for your reply!
Then should I change the transformers
related dist-packages on Colab, like tokenizer.py
file just as I used to do when local training ?
The transformers
library I'm using when examing is transformers==4.18.0
.
and the traceback code when error I pasted below, (since if I name the new mecab_dic
to unidic
, it will be duplicated with another library, I replaced it with Unidic
. )
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'BertJapaneseTokenizer'.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-15-e6e9378d9665>](https://localhost:8080/#) in <module>
7 word_tokenizer_type="mecab",
8 subword_tokenizer_type="wordpiece",
----> 9 mecab_kwargs={"mecab_dic":"Unidic"}
10 )
11
3 frames
[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1785 use_auth_token=use_auth_token,
1786 cache_dir=cache_dir,
-> 1787 **kwargs,
1788 )
1789
[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
1913 # Instantiate tokenizer.
1914 try:
-> 1915 tokenizer = cls(*init_inputs, **init_kwargs)
1916 except OSError:
1917 raise OSError(
[/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py](https://localhost:8080/#) in __init__(self, vocab_file, do_lower_case, do_word_tokenize, do_subword_tokenize, word_tokenizer_type, subword_tokenizer_type, never_split, unk_token, sep_token, pad_token, cls_token, mask_token, mecab_kwargs, **kwargs)
150 elif word_tokenizer_type == "mecab":
151 self.word_tokenizer = MecabTokenizer(
--> 152 do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})
153 )
154 else:
[/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py](https://localhost:8080/#) in __init__(self, do_lower_case, never_split, normalize_text, mecab_dic, mecab_option)
278
279 else:
--> 280 raise ValueError("Invalid mecab_dic is specified.")
281
282 mecabrc = os.path.join(dic_dir, "mecabrc")
ValueError: Invalid mecab_dic is specified.
Should I change the tokenizers and transformers library to this ? -->tokenizers==0.9.2 transformers==3.4.0
, I tried it and the error traceback as below:
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'BertJapaneseTokenizer'.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-23-e6e9378d9665>](https://localhost:8080/#) in <module>
7 word_tokenizer_type="mecab",
8 subword_tokenizer_type="wordpiece",
----> 9 mecab_kwargs={"mecab_dic":"Unidic"}
10 )
11
3 frames
[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1785 return obj
1786
-> 1787 # add_type_field=True to allow dicts in the kwargs / differentiate from AddedToken serialization
1788 tokenizer_config = convert_added_tokens(tokenizer_config, add_type_field=True)
1789 with open(tokenizer_config_file, "w", encoding="utf-8") as f:
[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
1913 """
1914 Find the correct padding/truncation strategy with backward compatibility
-> 1915 for old arguments (truncation_strategy and pad_to_max_length) and behaviors.
1916 """
1917 old_truncation_strategy = kwargs.pop("truncation_strategy", "do_not_truncate")
/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py in __init__(self, vocab_file, do_lower_case, do_word_tokenize, do_subword_tokenize, word_tokenizer_type, subword_tokenizer_type, never_split, unk_token, sep_token, pad_token, cls_token, mask_token, mecab_kwargs, **kwargs)
/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py in __init__(self, do_lower_case, never_split, normalize_text, mecab_dic, mecab_option)
ValueError: Invalid mecab_dic is specified.
Thank you for your time.
Thank you for the information, @lightercs.
Yes, it seems that you should use your custom tokenization files as you did when you performed training, since the transformers
library does not know anything about your customization of the tokenization scripts.
Hi, @Masatoshi Suzuki @.***>
I tried to change the mecab_dic
options on Colab's transformers/tokenization_utils_base.py]
Script since your last advice,
but it seems transformer still cannot accept the new mecab_dic option I
added.
Am I changed the wrong place? If so, how's the right way to specify a new
mecab_dic when training tokenizers, and use it after bert model
trained.(Btw, I tried to contact you by Email few days ago, but your
Tohoku-university email seems no longer available :( .
Hi, I trained a new vocab and bert model with my own datasets following your scripts, with the Mecab Dictionary being changed. but when I exam it, quite strange results returned everytime. Would you please help me check on this and give me some advice?
Details as below: My code:
the result:
the tokenize result is firstly quite odd as below, and then the predict results.
but when I change to your pre-trained tokenizer bert-base-v2 (still use my model), the result changed alot.
My local bert folder is like:
Thank you in advance.