Comparison between BERT, ELMo, and Flair embeddings

tabergma commented 5 years ago

We want to collect experiments here that compare BERT, ELMo, and Flair embeddings. So if you have any findings on which embedding type work best on what kind of task, we would be more than happy if you share your results. We are also going to run some experiments and share our results here.

minh-agent commented 5 years ago

We are trying to evaluate BERT/Flair for product attribute extraction (named entity extraction) in Japanese. So I would like to know how we can build our own model/embedding in Japanese.

Japanese does not have space between words. E.g, 私はお寿司を食べたい(I would like to eat Sushi). Is it possible to do the same thing as English, after we tokenize Japanese sentences (e.g, 私はお寿司を食べたい -> 私はお寿司を食べたい).

JoanEspasa commented 5 years ago

@minh-agent Do you already have a labeled dataset? That is the first step. If I recall correctly, the closest dataset could be Ontonotes. It has a PRODUCT category, but is not in japanese. Maybe if you dont have a dataset, to kickstart you could translate automatically and use that to build your first annotated dataset.

Regarding the spacing, note that spaces are only typically used for tokenizing. For japanese you may use any specific tokenizer, such as https://pypi.org/project/tinysegmenter/ for example.

alanakbik commented 5 years ago

Release 0.4 went live two days ago which includes BERT and ELMo embeddings. For instance, simply do:

from flair.embeddings import BertEmbeddings

# init embedding
embedding = BertEmbeddings()

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
embedding.embed(sentence)

to get started with BERT embeddings. Check out this tutorial for more info on ELMo, Flair and BERT embeddings.

stefan-it commented 5 years ago

From #329 here are some results for a Basque language model:

Model	Final Accuracy
FastText Wikipedia embeddings + `flair` language model	97.17
FastText Wikipedia embeddings + ELMo	95.50
FastText Wikipedia embeddings + BERT multilingual (`layers='-1,-2,-3,-4')	95.15
FastText Wikipedia embeddings + BERT multilingual (`layers='-4')	95.65
FastText Wikipedia embeddings + BERT multilingual (`layers=-1`)	95.97
FastText Wikipedia embeddings + ELMo Transformer	95.80

The ELMo model was trained from scratch with bilm-tf on the same training/validation/test data as for the flair language models. I'm currently trying to figure out how to do training and testing a pos tagging model with allennlp. The current master version of allennlp also provides code for training a transformer-based ELMo model (but there's only code for training NER model with a transformer-based ELMO model...)

Yugioh1984 commented 5 years ago

I'm trying to compare FLAIR LM with ELMO and BERT on the NER task, but I get very poor results for ELMO and BERT when I use it inside the FLAIR framework. Any best practices I should use in order to get better results?

stefan-it commented 5 years ago

I'll present a per-layer analysis (for BERT) in a few hours here :D

stefan-it commented 5 years ago

Here's a per-layer analysis for PoS tagging on Universal Dependencies for Basque, using the BERT multilingual model:

bert_multilingual_layers_ud_basque

Please mind the axis (it reaches from 94 to 96%)!

As you can see, using the last layer achieved an accuracy of 95.97.

Analysis is inspired by Peters et. al. (https://arxiv.org/abs/1808.08949).

I'll try to reproduce that experiment on other languages in near future :)

stefan-it commented 5 years ago

I'm not quite sure if we really should use a concatenation of the last four layers (as suggested in the original BERT Paper for NER), because this led to an accuracy of 95.15%.

Yugioh1984 commented 5 years ago

That's very interesting. It will be nice to see the results for other languages as well. By the way, did you used FLAIR framework to make this experiments or you implemented the code from scratch? I'm trying to make similar experiments for many languages on the NER task using flair.

alanakbik commented 5 years ago

Wow thanks for this analysis - very interesting. If this holds up for other tasks, we should consider changing the default to last layer only!

(FYI @jacobdevlin-google - BERT analyzed for Basque)

stefan-it commented 5 years ago

Here's a per-layers analyis on Pos tagging for German (on Universal Dependencies):

bert_multilingual_ud_german

Yugioh1984 commented 5 years ago

Very interesting. It's a little different than the Basque language but it seems that using the last layer gives the best results. By the way, what's the max_epochs that you used for your experiments and when you get the best results?

stefan-it commented 5 years ago

@Yugioh1984 Thanks for your questions :) I trained the PoS tagging models with the flair library (and not from scratch) using the multilingual model (more details about training data can be found here).

Basque

I used an annealing learning rate with a patience of 3. Thus, for Basque the number of epochs varied from 113 to 157. The training time (on a RTX 2070) was between 1:06h to 1:31h (per layer).

Detailed results can be found in my experiments repository (including hyperparameters).

German

I also used an annealing learning rate with a patience of 3 for German. The number of epochs varied from 155 to 211. The training time (also on a RTX 2070) was longer than for Basque dataset and was between 3:17h to 4:22h (per layer).

Detailed results can also be found in my experiments repository.

Notice: I used use_crf=False. Turning it on would boost performance, but I simply forgot to turn it on ;)

alanakbik commented 5 years ago

@stefan-it how long did it take to compute the ELMo-tf models compared to the Flair LMs? Is it difficult to compute an ELMo model this way and integrate it into Flair? Do you have a feeling of how transformer and LSTM ELMo compare?

@Yugioh1984 what results are you seeing with ELMo on NER? Back when we did experiments for the paper, we got pretty good results for ELMo across all English tasks, but haven't experimented with it much since.

Yugioh1984 commented 5 years ago

@alanakbik Actually my understanding is that to fine-tune LMs likes BERT/ELMO for a specific dataset I should just do the training for few epochs (<=5) so I was surprised since I got poor results and I thought that I was missing something. I rerun the experiment using BERT for 50 epochs and now I got better results. I will do the same for ELMO to see if I got similar behavior.

stefan-it commented 5 years ago

@alanakbik For the (bi) Transformer ELMo model one epoch took about 2 hours. One direction of a Flair Embeddings model took about 1:40h (so 3:20h for forward + backward model).

Training a LSTM ELMo model took about 8 - 12 hours for one epoch. One very big disadvantage is that the TensorFlow implementation for the LSTM ELMo model needs a lot of RAM (PC RAM, not GPU RAM): in my experiments training took over 32 GB (!) of RAM and my workstation only had 32 GB, so it was swapping a bit.

I only made experiments with Basque with ~ 37,778,224 tokens. I wanted to train an own model for Dutch, but with the LSTM ELMo TensorFlow implementation I ran out of memory (both PC RAM and GPU RAM). I haven't tried the Transformer implementation yet.

I'm currently trying to get the word embeddings from the Transformer ELMo model, see this issue in allennlp. When this is possible with the allennlp library we could also integrate the Transformer ELMo model in flair :)

Another comparison candidate: The upcoming release of the pytorch-pretrained-BERT will also include methods to get embeddings from an OpenAI GPT Transformer model, see this pull request. Once this is released, I think we could easily add support for OpenAI GPT model in flair :)

alanakbik commented 5 years ago

@stefan-it thanks for sharing your experience! It would be great is we could integrate the OpenAI transformer so hoping we can do this soon :)

stefan-it commented 5 years ago

I wrote a new embedding class ELMoTransformerEmbeddings and was able to load my own trained ELMo transformer-based model 😊

On UD Basque an accuracy of 95.80 % could be achieved (I also updated the result in the table above).

I'll open a PR for introducing the new ELMoTransformerEmbeddings class soon 🙃

alanakbik commented 5 years ago

Wow this is great!! Really look forward to playing around with ELMoTransformerEmbeddings!

Could you perhaps in a separate PR after this also add some instructions on how to train ELMo transformer LM and use them in Flair?

iamyihwa commented 5 years ago

Hi guys, I have tried different embeddings for the sentiment classification task [SemEval2018 Affects in Tweets: Task E-c] (https://competitions.codalab.org/competitions/17751#learn_the_details-overview) Detecting Emotions (multi-label classification) This task is when given a tweet, classifying it as one of sentiments e.g. anger, happy, etc. An example training data formatted for Flair is following: labelanticipation labeloptimism labeltrust “Worry is a down payment on a problem you may never have'. Joyce Meyer. #motivation #leadership #worry

I have used DocumentLSTMEmbeddings for this task, and used different combinations of embeddings. 1) Glove, Flair and Bert

2) Glove, Flair

3) [Glove-Twitter[(http://nlp.stanford.edu/data/glove.twitter.27B.zip) , Flair

4) Bert

There were not so significant differences among them. It could be due to the dataset being quite small, and the dataset sometimes being a bit ambiguous (at least to me it seems.. )

The current state of the art for this task is in the image below (You can see that current best result is about 0.1 better in F1 value than the results I get above) (Although my test data was not quite the same as theirs. Since I didn't have labels to the test data - and since the competition is already finished, I have divided validation set to 70% validation set, and 30% test set). Below image was taken from here

I have a couple of doubts.

I wonder if I am using Bert in the correct way. I guess right now we can only use one feature of Bert in the framework of Flair is to use it in embedding.

Any ideas on handling classification tasks better? I think it is quite nice, with just a little training (less than one hour of training or so) I get quite nice result, but it is still a bit far from state-of-the-art. Any suggestions /ideas on improving this (using deep learning models / flair for text classification).

stefan-it commented 5 years ago

@alanakbik I opened a PR for the transformer-based ELMo integration in flair. It would be awesome if you or others could fix the strange pickle error (this only occurs when the trained model is loaded and used for prediction. During training no errors should be thrown). See #399 :)

In that PR you could also find a link to the pretrained transformer-based ELMo model that I used in my experiments and a detailed description of how to train a model for PoS tagging :)

alanakbik commented 5 years ago

@iamyihwa thanks for posting these results - very interesting! You could probably try some hyperparameter optimization to get better results. But without the test data, it will be hard to compare against the results posted on the competition page. Perhaps when they release the test data, we can add it to the data loader to encourage experiments on this task!

@stefan-it Thanks very much for the PR! We'll take a look and see if we can fix this error!

stefan-it commented 5 years ago

@alanakbik Thanks :) I'll also write a documentation section for training a transformer-based ELMo model from scratch, so it can be used within flair.

minh-agent commented 5 years ago

Under support by @alanakbik , we did evaluation of NER in Japanese & English. You can see the following results. Both of Flair & BERT are excellent!

Yugioh1984 commented 5 years ago

Concerning ELMO, when I test it using Conll-03 dataset I got good results, but when I test it on a NER twitter dataset I got poor results in comparison with FLAIR/BERT? Any recommendations that I should use to fine-tune my model to get better results?

levi1996-zz commented 5 years ago

@minh-agent did you use Flair pre-trained model to achieve that result in Japanese? I was tried to train Flair language model then using it to train NER, but the result is very bad! Can you tell me how to achieve that result in Japanese! Thank you very much!

hendriksc commented 5 years ago

@minh-agent Could you go into the details how you achieved this results on the english CoNLL03 using BERT? Did you use a traditional word embedding in addition to BERT? What were the hyperparameters, did you finetune the pretrained BERT or some other variations to the architecture? Thanks in advance!

minh-agent commented 5 years ago

Hi! We just used pre-trained model: multi_cased_L-12_H-768_A-12 provided by the authors of BERT. Not using any additional word embedding. The architecture is same as the paper. Here is how we set hyper parameters.

python3 BERT_NER.py \ --task_name="NER" \ --do_train=True \ --do_eval=True \ --do_predict=True \ --data_dir=NERdata/en \ --column_sep=" " \ --vocab_file=./multi_cased_L-12_H-768_A-12/vocab.txt \ --bert_config_file=./multi_cased_L-12_H-768_A-12/bert_config.json \ --init_checkpoint=./multi_cased_L-12_H-768_A-12/bert_model.ckpt \ --max_seq_length=128 \ --train_batch_size=32 \ --learning_rate=2e-5 \ --num_train_epochs=6 \ --save_checkpoints_steps=1000 \ --output_dir=./output/result_dir/

stefan-it commented 5 years ago

I did some comparisons between ELMo and the ELMo Transformer model on CoNLL-2003 for NER.

With the original ELMo model I could achieve a F1-score of 92.02%. With the ELMo Transformer model a F1-score of 90.57% could be achieved (no word embeddings are used).

Btw: the ELMo Transformer model for English is now available on the AllenNLP page :) The model can directly loaded in flair using the ELMoTransformerEmbeddings class (more information here) :)

alanakbik commented 5 years ago

@stefan-it awesome, thanks for sharing these results. Interesting that the transformer fares less well than the original ELMo model. Have you tried stacking the ELMo transformer with GloVe embeddings? I could be wrong but I seem to remember that the ELMo LM initializes with standard word embeddings, so they are implicitly included here.

stefan-it commented 5 years ago

Hm, I ran an experiment with GloVe + ELMo transformer and the result is worse than using ELMo transformer embeddings only 🤔 (90.38% vs. 90.57%)

alanakbik commented 5 years ago

Strange, but this could indicate that GloVe does not add significant information over the transformer so they perform roughly the same in the downstream task. Still, interesting that the original ELMo seems to work so much better here.

Das-Boot commented 5 years ago

Hi guys, I have tried different embeddings for the causality extraction task. You can see the following results (Precision, Recall and F1-score). $0H{B)KB)UK_)E5D389S~WO2$ And you can also find detailed evaluations and discussions in my paper Causality Extraction based on Self-Attentive BiLSTM-CRF with Transferred Embeddings

alanakbik commented 5 years ago

Hello @Das-Boot thanks very much for sharing these results and the paper - looks really interesting!

alanakbik commented 5 years ago

@Das-Boot in the paper you note that the multi headed self attention is giving a boost in sequence labeling quality. Could you share your implementation, or consider adding it to Flair as a PR? I'd be curious if this also improves other sequence labeling tasks such as NER.

Das-Boot commented 5 years ago

Sorry, I am not very familiar with the pytorch version of multi head self attention (MHSA). In the paper I used the keras version of MHSA.

Yugioh1984 commented 5 years ago

@stefan-it: the new release PyTorch 1.1 support it out of the box via a new module called "nn.MultiheadedAttention", you can give it a try: https://github.com/pytorch/pytorch/releases/tag/v1.1.0

Shandilya21 commented 5 years ago

@stefan-it Hi, have you tried to [flair + Bert embedding] for Encoder-attend- decoder model? how was the performance, I am working on the conversational systems , i use glove and Elmo, but results are not on satisfactorily.

alanakbik commented 5 years ago

@Shandilya21 Flair embeddings might be helpful here, but I am not sure if anyone has used them yet in such a context. If you have interesting results here, please let us know!

Dragon615 commented 5 years ago

Hi,

I have a question regarding combining or using ELMO with Fasttetx or Flair embeddings. It's not clear to me which embedding should be used when training your model. So, if I have a word like "play" appears in multiple contexts, ELMO will produce an embedding for each word in each context, while Fasttext has its own way in producing the embeddings based on the sub-word information for the word "play". Can someone explain to me briefly how to use both fasttext and ELMO embeddings? Also, are you fine-tuning the pre-trained embeddings while training your models?
Sorry if my questions look silly :|

Thanks in advance for your help!

qiuwei commented 4 years ago

Under support by @alanakbik , we did evaluation of NER in Japanese & English. You can see the following results. Both of Flair & BERT are excellent!

Hi @minh-agent , could you give some rough numbers about the inference speed of flair for Japanese? We are not satisfied with the speed with bert, but not sure whether should try flair out for Chinese

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flairNLP / flair

Comparison between BERT, ELMo, and Flair embeddings #308

Basque

German