allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.75k stars 2.25k forks source link

How to generate embeddings using BERT? #2140

Closed Arjunsankarlal closed 5 years ago

Arjunsankarlal commented 5 years ago

As BERT is included in the new release, I am trying to generate embeddings as we generate using ELMo for contextual representation.

While working with ELMo, it was easy to generate embeddings as we have specific APIs like embed_sentence(), embed_sentences(), embed_batch() etc.

In the case of BERT, I downloaded the pre-trained models from google's BERT repo, I loaded the model using,

BertModel.from_pretrained('bert-base-uncased')

and tokenized the input sentence with BertTokenizer. Now the generation of embedding part is quite confusing and I could not find any clear documentation about it.

Or If I am not heading towards the right path towards the problem, please correct me. How to use BERT and ELMo combined, in order to get the more contextualised embeddings?

Any lead would be helpful.

schmmd commented 5 years ago

@Arjunsankarlal we use an implementation from huggingface to provide BERT embeddings as a part of a model architecture. If you're just interested in running BERT on some sample input to get word vectors, I recommend you take a look at their library directly.

BertModel is a class in their library and, in their README, the provide documentation for how to get word vectors.

search4mahesh commented 5 years ago

@schmmd Could you please provide point to exact example? Sorry I am confused here.

Thanks Mahesh

schmmd commented 5 years ago

Looks like huggingface documents this here: https://github.com/huggingface/pytorch-pretrained-BERT#bert