Add BERT - Githubissues

robbine commented 5 years ago

Google proposed pre training method called Bert https://arxiv.org/pdf/1810.04805.pdf I think it is a Goode idea to have it implemented and compare it with Elmo

schmmd commented 5 years ago

Thanks--we've had this request internally as well. PR's welcome ;-)

threefoldo commented 5 years ago

It's better to support multiple languages. I could help with Chinese.

matt-gardner commented 5 years ago

@threefoldo, if you want to add pre-trained language models for other languages that are compatible with our code, we'd be super happy to host them alongside our English ELMo model and highlight the work that you did. It's not our priority to train models in other languages just for the sake of supporting other languages, but we try to make our tools language-agnostic so other people can use them in their favorite language, and if you want to contribute back what you've done with it, all the better.

threefoldo commented 5 years ago

Thanks, @matt-gardner, I will let you know if I got something.

hzeng-otterai commented 5 years ago

A BERT implementation in pytorch. https://github.com/codertimo/BERT-pytorch

matt-gardner commented 5 years ago

@handsomezebra: We could do that level of implementation quite simply in AllenNLP - we just need a dataset reader that constructs data with the particular sampling, and a very simple model that has the correct losses. The more important "BERT implementation" will be actually loading the trained model that Google provides - actually training the model is the piece of work that no one wants to replicate themselves (as it's ~$30k just to train the thing once). If anyone does come up with some pytorch code that can load the model once Google releases it, that'd be great to know about.

robbine commented 5 years ago

I think it's still worth trying to add the particular dataset reader in allennlp even though training BERT is not encouraged. I find that there are 10 different ways to implement multi-head attention computation. A very simple model is not likely to perform well since "The devil is in the details".

matt-gardner commented 5 years ago

@robbine, we'd be happy to take contributions for a BERT dataset reader. It should be pretty straightforward for anyone that wants to add it (with the most complex parts being sure that you actually get high enough throughput for training to be practical, and making the sampling different at every epoch, if that matters). I very much doubt that we will do it ourselves, however - we're focusing on other things at this point.

You're right about there being a ton of ways to implement a transformer architecture, though, which is what I was talking about in my comment above. This is part of the model, not the dataset reader, and we have to wait until Google releases their model and their code to know exactly which one they used, so we can load their pre-trained model. Getting this part into AllenNLP will certainly be high priority for us, though hopefully we'll rely on someone else doing the initial port to pytorch (like the HuggingFace team did with the OpenAI model) - we're a small team and can't do everything ourselves.

HarshTrivedi commented 5 years ago

@robbine btw, original transformer encoder is already present in allennlp (link). So, in case you need multi-headed self attention for now, you can find it here.

WrRan commented 5 years ago

Google releases their model and code at bert.

robbine commented 5 years ago

After reading https://github.com/google-research/bert/blob/master/modeling.py, I am trying to write a similar model . But I am confused with the following pooling part, which tries to convert a tensor of shape [batch_size, seq_length, hidden_size] to a tensor of shape [batch_size, hidden_size]

  with tf.variable_scope("pooler"):
    # We "pool" the model by simply taking the hidden state corresponding
    # to the first token. We assume that this has been pre-trained
    first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
    self.pooled_output = tf.layers.dense(
        first_token_tensor,
        config.hidden_size,
        activation=tf.tanh,
        kernel_initializer=create_initializer(config.initializer_range))`

This pooler mechanism only works if the model has already been pre-trained because it only takes the first token representation in each sentence. What if I want to train from scratch, Is there a pooling logic I can resort to?

DeNeutoy commented 5 years ago

This question would be better asked on the Bert repository, but in answer to your question, Tensorflow uses a graph based approach to running neural networks. This means that before anything happens, you build a graph containing all of your possible operations (including the pooling step, which does not happen at training time). During training, this part of the graph is not computed, because the outputs that are used for training (i.e the loss) do not depend on it. It is a substantially different way of running computations than for instance, Pytorch, where everything is evaluated line-by-line (imperative execution). In general, when reading tensorflow code, it is helpful to remember that the code is just building a graph (not executing any operations) until the tf.Session().run(outputs, inputs) call, at which point only the operations required to compute the outputs are actually run. Hopefully that helps you understand what's happening a bit more!

ethanjperez commented 5 years ago

Can extracted BERT features be used with AllenNLP directly, in the same way ELMo representations can? Or do e.g. Word Piece embeddings and [CLS] or [SEP] tokens make using BERT embeddings not straightforward?

In particular, I'm interested in using extracted BERT features for SQuAD training in AllenNLP. Any help appreciated - thanks!

robbine commented 5 years ago

This question would be better asked on the Bert repository, but in answer to your question, Tensorflow uses a graph based approach to running neural networks. This means that before anything happens, you build a graph containing all of your possible operations (including the pooling step, which does not happen at training time). During training, this part of the graph is not computed, because the outputs that are used for training (i.e the loss) do not depend on it. It is a substantially different way of running computations than for instance, Pytorch, where everything is evaluated line-by-line (imperative execution). In general, when reading tensorflow code, it is helpful to remember that the code is just building a graph (not executing any operations) until the tf.Session().run(outputs, inputs) call, at which point only the operations required to compute the outputs are actually run. Hopefully that helps you understand what's happening a bit more!

Thanks! It seems that 'self.pooled_output' is not used for fine-tuning purpose. For example in run_squad.py, the final_hidden output ( model.get_sequence_output() ) is used to compute logits and losses.

robbine commented 5 years ago

Can extracted BERT features be used with AllenNLP directly, in the same way ELMo representations can? Or do e.g. Word Piece embeddings and [CLS] or [SEP] tokens make using BERT embeddings not straightforward?

In particular, I'm interested in using extracted BERT features for SQuAD training in AllenNLP. Any help appreciated - thanks!

I think the idea of bert is using a different training training logic(Masked LM and Next sentence prediction), as for fine tuning part, it does not require any special tokens.

ethanjperez commented 5 years ago

Doesn't the AllenNLP SQuAD code use word-level tokenization, instead of word-piece (sub-word) tokenization? Is there a way to incorporate contextual subword embeddings from elsewhere into an AllenNLP SQuAD model?

ethanjperez commented 5 years ago

@matt-gardner Hugging Face released a PyTorch version of BERT that loads the BERT weights: https://github.com/huggingface/pytorch-pretrained-BERT

Is AllenNLP looking into incorporating this into the repo?

joelgrus commented 5 years ago

we are looking into it

ethanjperez commented 5 years ago

Great, thank you so much!

thomwolf commented 5 years ago

Hi guys, I have been able to reproduce the TensorFlow's repo results with our PyTorch BERT both with BERT-base and BERT-large so I am now confident in providing it as a PyPi package.

I would be happy if it could fit nicely with AllenNLP as Matt was asking. I have a question regarding weight loading.

What do you think is the most convenient:

have TensorFlow as a dependency and load from the TF checkpoints (currently we only have PyTorch and tqdm).
host PyTorch dumps of the TF models on S3 (we can also do that).

If you have an opinion, I'd be happy to hear about it :-)

joelgrus commented 5 years ago

I am pretty sure that we would rather have PyTorch dumps of the weights than have to take an extra dependency on tensorflow.

susht3 commented 5 years ago

how to add bert model?

allenai / allennlp

Add BERT #1901