Open ChiuHsin opened 5 years ago
Hi, There are some special tokens in the vocabulary(for example BOS stands for Beginning Of Sentence) and we can either put them at the beginning of a lookup table(embedding) or at the end. I decided to put them at the beginning. And for the "UNUSED_COUNT" you can check the vocab files in pretrained BERT models.
Ah, you might be confused by their usage, right? Let's say you want to feed a sentence into your network, so you should add the BOS and EOS tokens to your sentence and you should know their locations in the embedding table
I see, but when I load_google_bert model, the vocab_size = vocab_size - TextEncoder.BERT_SPECIAL_COUNT - TextEncoder.BERT_UNUSED_COUNT, but it doesn't match when w_id ==2 'weights[w_id][vocab_size + TextEncoder.EOS_OFFSET] = saved[3 + TextEncoder.BERT_UNUSED_COUNT] ' this line can not load the weight.
@ChiuHsin I guess you are right, and it seems that you were able to solve it(based on the other issue you posted) can you please send a pull request to correct this problem? thanks!
When I use the BERT-keras, I don't understand this part:
class TextEncoder: PAD_OFFSET = 0 MSK_OFFSET = 1 BOS_OFFSET = 2 DEL_OFFSET = 3 # delimiter EOS_OFFSET = 4 SPECIAL_COUNT = 5 NUM_SEGMENTS = 2 BERT_UNUSED_COUNT = 99 # bert pretrained models BERT_SPECIAL_COUNT = 4 # they don't have DEL
Why would you set it up like this? and the BERT_UNUSED_COUNT = 99 BERT_SPECIAL_COUNT = 4 are used in load_google_bert.