dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

Difference between embeddings Gluon and Huggingface #1203

Open evah88 opened 4 years ago

evah88 commented 4 years ago

We have a BERT model that we trained from scratch on a proprietary dataset using Huggingface. I'm trying to port it to the GluonNLP version of BERT and roughly followed the conversion script. Specifically, we found the matching parameter names and then copied the model weights. The output of the converted gluon model is different from our original huggingface model so I'm trying to debug.

To simplify I calculated the embeddings of the sentence "Hello, my dog is cute" using the pretrained BERT models from GluonNLP and Huggingface and the encodings are different as well.

Code to calculate GluonNLP embeddings:

import gluonnlp as nlp; import mxnet as mx;
model, vocab = nlp.model.get_model('bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', use_classifier=False, use_decoder=False);
tokenizer = nlp.data.BERTTokenizer(vocab, lower=True);
transform = nlp.data.BERTSentenceTransform(tokenizer, max_seq_length=512, pair=False, pad=False);
sample = transform(['Hello, my dog is cute']);
words, valid_len, segments = mx.nd.array([sample[0]]), mx.nd.array([sample[1]]), mx.nd.array([sample[2]]);
seq_encoding, cls_encoding = model(words, segments, valid_len);

Code to calculate Huggingface embeddings:

from transformers import BertModel, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
%timeit outputs = model(input_ids)
outputs = model(input_ids)

last_hidden_states = outputs[0] 

The result is that seq_encoding and last_hidden_states are very different. Any suggestions on what we're missing?

szhengac commented 4 years ago

I think gluonnlp's vocab has different order from the one of huggingface. So it will be problematic if we simply copy the entire embedding matrix. @eric-haibin-lin.

evah88 commented 4 years ago

The word embeddings of the inputs are identical so the difference happens somewhere downstream

eric-haibin-lin commented 4 years ago

For the bert base uncased model, the vocab mapping is different, so the embedding weights need to be shuffled accordingly. @evah88 did you print the word ids in the batch and compare the two?

devsentient commented 4 years ago

@eric-haibin-lin please take a look here: https://colab.research.google.com/drive/1eZAHtpAP5bzz4PA_gX5HeJyiB0GtaPWX#scrollTo=-5X58oHMFu9P

It looks like the token IDs are the same for Gluon and HuggingFace.

eric-haibin-lin commented 4 years ago

The bos and eos token ids are different: HF: [101, 7592, 1010, 2026, 3899, 2003, 10140, 102] Gluon: [ 2, 7592, 1010, 2026, 3899, 2003, 10140, 3]

devsentient commented 4 years ago

yes, but these are the same tokens and we pass the sentence through the gluon and HF tokenizers respectively so the output should be the same? Why would the same sentence encode differently?

szhengac commented 4 years ago

i think @eric-haibin-lin means that embedding_gluonnlp[2] \not= embedding_hf[101], as you simply copied the embedding matrix without reordering.

devsentient commented 4 years ago

In this example we didn't copy the matrix. We loaded the pretrained models from each framework directly.

In our proprietary use case the token mapping is the same between HF and Gluon, and the parameter matrix is copied, but the output still different.

carter54 commented 4 years ago

@devsentient @evah88 I'm also trying to transfer model from huggingface to mxnet by matching parameter names, but I am transfering GPT 2. Initially I also got different results from hugginface and mxnet. It took me 2 days to figure out the reason... In GPT2 model, huggingface uses Conv1D layer to do matrix projection (x*weight+bias), while gluonnlp uses Dense layer. Thus, the weight matrix should be transposed when transferring the parameters. in gpt2, I did this in the following mapping

'transformer.h.(\d+).attn.c_proj.weight': '_self_attention_layers.{}._out_proj.weight'.

In BERT model, there might be some trick like this

andreas-solti commented 3 years ago

Any updates on this? @carter54 @evah88 I'm also struggling with using the weights of BERT in gluon. I've found this approach here to convert from DistilBERT: https://nlp.gluon.ai/v0.9.x/model_zoo/conversion_tools/index.html and adjusted the mapping for BERT.

Does this mapping seem to be right? Mapping proposal from Gluon to Pytorch BERT Which weight layers need to be transposed? Do some need to be reversed?

szha commented 3 years ago

@andreas-solti the mapping looks correct. I don't think there's a need to transpose the weight. The embedding weight indices need to shuffle because:

The bos and eos token ids are different: HF: [101, 7592, 1010, 2026, 3899, 2003, 10140, 102] Gluon: [ 2, 7592, 1010, 2026, 3899, 2003, 10140, 3]

andreas-solti commented 3 years ago

@szha Thanks a lot for your feedback! Could you please elaborate on the shuffling of the weights? I've tried swapping position 2 and 101, and swapping 3 and 102 respectively. The embedding result is almost accurate, but not quite. The masked language model part also produces different ordering at lower probability results.

Would be really helpful!

leezu commented 3 years ago

The embedding result is almost accurate, but not quite.

What's the largest difference? Is it larger than 1E-3, 1E-4 or 1E-5?

andreas-solti commented 3 years ago

It is larger than 1E-3 (Thanks @leezu for asking!): https://gist.github.com/andreas-solti/43db715d33cb0157b2c535b41dd4573c And the classification layer on top extrapolates these differences more. I found while mapping the parameter layers that the two layers in BERT are the same: word_embed.0.weight == decoder.3.weight Do both need to be "swapped"?

andreas-solti commented 3 years ago

Here is a reproducible example notebook that translates a german BERT to MXNet:

https://gist.github.com/andreas-solti/4222c389b8be139e597eccc8350c034b

The output classes look fine in terms of ordering. The weights and inputs being exaclty the same, I wonder where the smaller/larger differences come from.