Open hccho2 opened 2 years ago
@hccho2 -san
I think you read the Korean version of the book. I'm glad to read my book outside of Japan.
I appreciate your valuable suggestion.
As you mentioned, the code is a little bit wasteful and confusing. I agree that, and I commented in the code above of this section as following,
このまま、TEXT.vocab.stoi= vocab_bert (stoiはstring_to_IDで、単語からIDへの辞書) としたいですが、一度bulild_vocabを実行しないとTEXTオブジェクトがvocabのメンバ変数をもってくれないです。 'Field' object has no attribute 'vocab' というエラーをはきます) 一度適当にbuild_vocabでボキャブラリーを作成してから、BERTのボキャブラリーを上書きします
I'm sorry the comment part is in Japanese, I hope you will translate with some tool to your own language.
My comment part means that we want to execute
TEXT.vocab.stoi= vocab_bert
without executing TEXT.build_vocab().
This is because the vocab is overridden by the vocab_bert.
So your mention as
The [UNK] token ids before and after are different.
is right, and more over, other word tokens are also different.
Then why I executed the TEXT.build_vocab(train_ds, min_freq=1)
is the TEXT object has no vocab member variable initially.
We need to make the vocab member variable in some way.
I chose one of the way to execute the waste code TEXT.build_vocab(train_ds, min_freq=1)
, for not to building the correct vocabulary but to make the member variable of vocab
in the TEXT object.
Sincerely, Yutaro Ogawa
@YutaroOgawa Thank you for replying. That's not what I'm pointing out.
samples = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
TEXT = torchtext.data.Field(sequential=True,batch_first=True,include_lengths=False,unk_token='[UNK]')
fields = [('text', TEXT)]
sequences=[]
for s in samples:
sequences.append(torchtext.data.Example.fromlist([s], fields))
for s in sequences:
print(s.text)
mydataset = torchtext.data.Dataset(sequences,fields)# Example ==> Dataset생성
TEXT.build_vocab(mydataset, min_freq=1, max_size=10000)
print(TEXT.vocab.stoi)
mydataset = torchtext.data.Iterator(dataset=mydataset, batch_size = 3,shuffle=False)
for d in mydataset:
print(d.text.numpy())
my_vocab = {'[UNK]': 100, '<pad>': 1, 'document': 2, 'the': 3, '.': 4, 'is': 5, 'This': 6, 'first': 7, 'this': 8}
mode = 0
if mode==0: # Error
TEXT.vocab.stoi = my_vocab
else: # Good!!
my_vocab_dict = collections.defaultdict(lambda:100)
my_vocab_dict.update(my_vocab)
TEXT.vocab.stoi = my_vocab_dict
for d in mydataset:
print(d.text.numpy())
if mode =0, an error occurs. if mode=1, no errors.
@hccho2 -san
I appreciate your very specific explanation with code.
I understand that the execution of the bellow torchtext code such as
TEXT = torchtext.legacy.data.Field(sequential=True,batch_first=True,include_lengths=False,unk_token='[UNK]')
assign the [UNK] token to ID 0.
But in the bert-base-uncased-vocab.txt, the [UNK] token's ID is 100 and also the ID 101 is [CLS].
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
I understand your point. And my Answer is in the 8-2-3_bert_base.ipynb and 8-4_bert_IMDb, I tokenized the documents after over overriding the torchtext's vocab.stoi as below
Thus, the torchtext's vocab.stoi is perfectly ignored.
Your example code in above comments does not tokenize with newer vocab object after overriding the vocab.
Could this answer help you?
Thanks for the explanation.
Converting new sentences containing unknown words(tokens) by tokenizer
has no problems.
But, converting new sentences containing unknown words(tokens) by TEXT.numericalize
could make erros.
Anyway, I understood your explanation.
Thank you.
I appreciate you for giving us a lot of important information and indications.
Sincerely,
TEXT.vocb.stoi is overwritten by vocab_bert. The [UNK] token ids before and after are different.
suggestion: