Possible inconsistency in how ids are used for text encoders

teoh commented 5 years ago

The pad_id as defined in data/vocab.py seems inconsistent with BERT's pad id.

Let's say I run:

bert_text_encoder = BERTTextEncoder(vocab_file = './google_bert/model/uncased_L-12_H-768_A-12/vocab.txt')

(In this example, the path for the vocab file is installed in the google_bert submodule under a directory called model.)

I can get pad_id with:

bert_text_encoder.pad_id

which gives me 30522.

BUT if I dig into the BERT tokenizer I get something else:

bert_text_encoder.tokenizer.inv_vocab[0]

which gives me '[PAD]', suggesting that the real id of the pad token is 0.

In short, the id definitions in the TextEncoder base class look like they disagree with the BERT tokenizer's use of ids.

DUTANGx commented 5 years ago

also, all ids are shifted in data/vocab.py

Separius commented 5 years ago

Hi @teoh, If I remember correctly I handled the inconsistency in data/vocab.py with the standardize_ids() method, did you check that? but I totally agree that this hack is not that good and breaks the abstraction and I should make pad_id private (_pad_id) maybe?

Separius commented 5 years ago

hmm, I read my code again and I think the pad_id is right; yeah it's different from the tf bert model but it's consistent with the corresponding embedding layer

teoh commented 5 years ago

Hi @Separius

I handled the inconsistency in data/vocab.py with the standardize_ids() method, did you check that?

In standardize_ids(), I think you're referring to this line:

ids[i] -= self.bert_msk_id

In the example I did at the top, bert_text_encoder.bert_msk_id is 103. If '[PAD]' is passed into standardize_ids(), since its original id (from BERT's tokenizer) 0, standardizing it will give a negative id.

Anyways, looking at this closer, I see two larger issues:

I don't think '[PAD]' even makes it into the output of encode().
encode() doesn't encode a sentence according to Google BERT's convention. I'm referring to this comment.

Here's the definition of encode() for ease of reference:

def encode(self, sent: str) -> List[int]:
    return self.standardize_ids(self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(sent)))

If I want to encode 'my dog is hairy', I can call bert_text_encoder.encode('my dog is hairy.'), which returns [1923, 3796, 1900, 15789, 909]. I'm guessing this represents the list: ['my', 'dog', 'is', 'hairy', '.'].

If we use the convention from Google BERT (example code) as well as their FullTokenizer, we would be encoding ['[CLS', 'my', 'dog', 'is', 'hairy', '.', '[SEP]'], which gives us [101, 2026, 3899, 2003, 15892, 1012, 102]. The tokenize() method from Google's FullTokenizer does not automatically put those special tokens '[CLS]' and '[SEP]' in.

From this I have a few questions:

Do you plan to include the '[CLS]' and '[SEP]' tokens in the BERTTextEncoder class?
Which encoding (i.e. mapping of tokens to ids) is the correct one to use for your Keras implementation of BERT? Your standardized ones, or the ones used in the google_bert repo?
If it's your standardized ones, then how does the Embedding layer account for this? You mention that:

yeah it's different from the tf bert model but it's consistent with the corresponding embedding layer

but given that (i) we're loading the weights in from Google BERT models directly into the model (transformer/model) (ii) I can't seem to find any special code in transformer/embedding.py that handles going from standardized encoding to BERT encoding, I'm not sure if these standardized encodings are consistent with the corresponding embedding layer.

Thanks!

Separius commented 5 years ago

Hey @teoh, I guess you are right and I might have missed some tokens, I will check that again soon. regarding including [cls] token, that's part of the LMGenerator class, both single sentence, and double sentence generators add special tokens. and Embedding does not handle the transformation, it's handled in the loading method here Thanks again for your clear explanation, it's really helpfull :+1:

teoh commented 5 years ago

I also forgot to mention thank you for putting this repo together!! I couldn't find many other BERT Keras implementation that would let me do fine-tuning, so I think this will be really useful to a lot of people.

I might have missed some tokens, I will check that again soon.

Got it, thanks!

regarding including [cls] token, that's part of the LMGenerator class, both single sentence, and double sentence generators add special tokens.

I'll take a closer look at this and will let you know if I have any questions.

and Embedding does not handle the transformation, it's handled in the loading method here

I see that you're referring to transformer.load.load_google_bert(), which looks really similar to transformer.model.load_bert() (and was originally the one I was looking into that motivated question (3) above). What is the difference between these two functions?

Thanks again for your clear explanation, it's really helpfull 👍

Thanks! I spent half an hour writing that because the time difference (I'm writing this from San Francisco) makes every message count.

I'm sure you're already really busy maintaining this project and others. I have some suggestions that I think would really help people using this repo -- feel free to assign levels of priority that you feel are appropriate.

More documentation on the functions and classes in the transformer module, and appropriate uses, e.g. transformer.load.load_google_bert() vs. transformer.model.load_bert(). I know you have a tutorial.ipynb to show this, but the BERT-specific things are only clear when you spend a long time digging through your module files. Most people using this repo will only be interested in the BERT stuff -- after all, your project is called "BERT-keras" ;)
Have an option for loading BERT using embeddings that are BERT-specific, without any standardization. I realize the standardization makes it easier to compare with the other models (OpenAI, etc), but assuming people come to this repo just for BERT, this is an opportunity to make things simpler.

Thank you for being receptive to all the feedback and answering my questions. Keep up the great work!

Separius commented 5 years ago

You are absolutely right, I have to add a lot of documentation for the BERT part, I think I will have some time tomorrow to do that and once again thanks for your complete explanations.

Separius commented 5 years ago

Hey @teoh, I removed the unused loading function that was confusing you and added some comments to the BERT loading function as well as adding a link to the BERT notebook, I hope it helps

Separius / BERT-keras

Possible inconsistency in how ids are used for text encoders #11