[UNK] id mismatching in "8-4_bert_IMDb.ipynb"

hccho2 commented 2 years ago

from utils.bert import BertTokenizer, load_vocab

vocab_bert, ids_to_tokens_bert = load_vocab(vocab_file="./vocab/bert-base-uncased-vocab.txt")

TEXT.build_vocab(train_ds, min_freq=1)
TEXT.vocab.stoi = vocab_bert

TEXT.vocb.stoi is overwritten by vocab_bert. The [UNK] token ids before and after are different.

suggestion:

tokenizer_bert = BertTokenizer( vocab_file="./vocab/bert-base-uncased-vocab.txt", do_lower_case=True)
unknown_token_id = tokenizer_bert.vocab['[UNK]']
vocab_dict = defaultdict(lambda: unknown_token_id)
vocab_dict.update(tokenizer_bert.vocab)

TEXT.build_vocab(train_ds, min_freq=1)
TEXT.vocab.stoi = vocab_dict

YutaroOgawa commented 2 years ago

@hccho2 -san

I think you read the Korean version of the book. I'm glad to read my book outside of Japan.

I appreciate your valuable suggestion.

As you mentioned, the code is a little bit wasteful and confusing. I agree that, and I commented in the code above of this section as following,

このまま、TEXT.vocab.stoi= vocab_bert (stoiはstring_to_IDで、単語からIDへの辞書) としたいですが、一度bulild_vocabを実行しないとTEXTオブジェクトがvocabのメンバ変数をもってくれないです。 'Field' object has no attribute 'vocab' というエラーをはきます）一度適当にbuild_vocabでボキャブラリーを作成してから、BERTのボキャブラリーを上書きします

I'm sorry the comment part is in Japanese, I hope you will translate with some tool to your own language.

My comment part means that we want to execute

TEXT.vocab.stoi= vocab_bert without executing TEXT.build_vocab().

This is because the vocab is overridden by the vocab_bert.

So your mention as

The [UNK] token ids before and after are different.

is right, and more over, other word tokens are also different.

Then why I executed the TEXT.build_vocab(train_ds, min_freq=1) is the TEXT object has no vocab member variable initially.

We need to make the vocab member variable in some way.

I chose one of the way to execute the waste code TEXT.build_vocab(train_ds, min_freq=1), for not to building the correct vocabulary but to make the member variable of vocab in the TEXT object.

Sincerely, Yutaro Ogawa

hccho2 commented 2 years ago

@YutaroOgawa Thank you for replying. That's not what I'm pointing out.

samples = [
     'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
     'Is this the first document?',
 ]

TEXT = torchtext.data.Field(sequential=True,batch_first=True,include_lengths=False,unk_token='[UNK]')
fields = [('text', TEXT)]

sequences=[]
for s in samples:
    sequences.append(torchtext.data.Example.fromlist([s], fields))

for s in sequences:
    print(s.text)

mydataset = torchtext.data.Dataset(sequences,fields)# Example ==> Dataset생성

TEXT.build_vocab(mydataset, min_freq=1, max_size=10000)  
print(TEXT.vocab.stoi)  

mydataset = torchtext.data.Iterator(dataset=mydataset, batch_size = 3,shuffle=False)  

for d in mydataset:
    print(d.text.numpy())

my_vocab = {'[UNK]': 100, '<pad>': 1, 'document': 2, 'the': 3, '.': 4, 'is': 5, 'This': 6, 'first': 7, 'this': 8}

mode = 0

if mode==0:   #  Error
    TEXT.vocab.stoi = my_vocab
else:  # Good!!
    my_vocab_dict = collections.defaultdict(lambda:100)
    my_vocab_dict.update(my_vocab)
    TEXT.vocab.stoi = my_vocab_dict

for d in mydataset:
    print(d.text.numpy())

if mode =0, an error occurs. if mode=1, no errors.

YutaroOgawa commented 2 years ago

@hccho2 -san

I appreciate your very specific explanation with code.

I understand that the execution of the bellow torchtext code such as

TEXT = torchtext.legacy.data.Field(sequential=True,batch_first=True,include_lengths=False,unk_token='[UNK]')

assign the [UNK] token to ID 0.

But in the bert-base-uncased-vocab.txt, the [UNK] token's ID is 100 and also the ID 101 is [CLS].

https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt

I understand your point. And my Answer is in the 8-2-3_bert_base.ipynb and 8-4_bert_IMDb, I tokenized the documents after over overriding the torchtext's vocab.stoi as below

Thus, the torchtext's vocab.stoi is perfectly ignored.

Your example code in above comments does not tokenize with newer vocab object after overriding the vocab.

Could this answer help you?

hccho2 commented 2 years ago

Thanks for the explanation.

Converting new sentences containing unknown words(tokens) by tokenizer has no problems. But, converting new sentences containing unknown words(tokens) by TEXT.numericalize could make erros.

Anyway, I understood your explanation.

Thank you.

YutaroOgawa commented 2 years ago

I appreciate you for giving us a lot of important information and indications.

Sincerely,

YutaroOgawa / pytorch_advanced

[UNK] id mismatching in "8-4_bert_IMDb.ipynb" #199