UKPLab / elmo-bilstm-cnn-crf

BiLSTM-CNN-CRF architecture for sequence tagging using ELMo representations.
Apache License 2.0
389 stars 81 forks source link

How to use allennlp.modules.elmo.Elmo class in keras? #25

Open ghost opened 5 years ago

ghost commented 5 years ago

It is the link which has used ELMo embeddings through allennlp library in keras model https://github.com/UKPLab/elmo-bilstm-cnn-crf/blob/master/Keras_ELMo_Tutorial.ipynb

In this example, allennlp.commands.elmo.ElmoEmbedder class is used which returns three vectors for each word, each vector corresponding to a layer in the ELMo LSTM output.

However ,learning a weighted average of ELMo vectors isn't possible with allennlp.commands.elmo.ElmoEmbedder class and is possible with allennlp.modules.elmo.Elmo. (as per the information given in this link https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md)

I'm unable to understand , how to use allennlp.modules.elmo.Elmo class in Keras model to get ELMo embeddings ?

nreimers commented 5 years ago

In order to compute a weighted average, these weights must be updated during gradient decent. However, we get the gradient only for the Keras layers, not for any layers from Pytorch (i.e. for any allennlp layers).

Hence, you cannot use allennlp.modules.elmo.Elmo with Keras.

You can implement a weigthing layer in Keras in order to compute a weighted average: https://github.com/UKPLab/elmo-bilstm-cnn-crf/blob/master/neuralnets/keraslayers/WeightedAverage.py

Also have a look in my paper: https://arxiv.org/pdf/1904.02954.pdf

Computing a fixed average leads in most cases to a similar performance as computing a weighted average.

ghost commented 5 years ago

Thanks and nice explanation.

In the original ELMo paper, the author has included 4 task specific weights ( one weight for scaling and other three weights being softmax normalized weights).

I would like to know, whether the learned weighted average (https://github.com/UKPLab/elmo-bilstm-cnn-crf/blob/master/neuralnets/keraslayers/WeightedAverage.py) includes all the four task specific weights or only three task specific weights? Are these weights softmax normalized?

ghost commented 5 years ago

"Alternative Weighting Schemes for ELMo Embeddings" (https://arxiv.org/pdf/1904.02954.pdf ) is an excellent work with a very good analysis.

I have read the paper completely and I have one small doubt

According to the paper "Deep Contextualized Word representations" , ELMo embedding of a word is obtained as weighted sum of three layer outputs followed by scaling"

In your paper (https://arxiv.org/pdf/1904.02954.pdf ), you have mentioned the same formula (in Introduction section) but you have used the term "weighted average" instead of "weighted sum".

Correct me, if I'm wrong.

nreimers commented 5 years ago

Hi @kalyanks0611 according to the Peters et al. paper: 'we allowed the task model to learn a weighted average of all biLM layers,'

I'm not sure where you found the phrase weighted sum?

In the formular, and also in the AllenNLP implementation, there is a weighted sum of the three ELMo layers. However, the weights s_j sum up to 1 by their construction, which makes this a weighted average.

A specialty in the ELMo variation is that the factors are softmax-normalized, i.e., zero- or even negative weights are not possible in the AllenNLP implementation / in the original paper from Peters et al.

ghost commented 5 years ago

Thanks @nreimers

In the original ELMo paper, the author has included 4 task specific weights ( one weight for scaling and other three weights being softmax normalized weights).

I would like to know, whether the learned weighted average (https://github.com/UKPLab/elmo-bilstm-cnn-crf/blob/master/neuralnets/keraslayers/WeightedAverage.py) includes all the four task specific weights or only three task specific weights? Are these weights softmax normalized?

nreimers commented 5 years ago

The code you are linking to only learns a weight for the 3 ELMo layers (and not the scaling). Further, it does not learn a softmax normalized weight.

The formular in the paper is: \gamma \sum s_j h_j

If you include the \gamma into the sum, you can write it as:

\sum (\gamma s_j) h_j = \sum k_j h_j with k_j = \gamma s_j

As the implementation does not restrict the values for k_j, a scaling factor is not needed.

Futher, the implementation provided in this repository allows that individual weights are zero or negative, which is not possible in the original ELMo implementation. I.e., this summing up of the three vectors should be more powerful than the weighting shown in the ELMO paper.

Sadly I didn't evaluate if this has an impact for downstream tasks or if a softmax normalized weighting scheme is better. So if someone could do this evaluation, I would highly appreciate it.

ghost commented 5 years ago

But, in your paper (https://arxiv.org/pdf/1904.02954.pdf) in the description of Table 1, you have mentioned like "W.-Avg.: Learned weighted average of the three layers, as proposed by Peters et al., W.-Avg. 1 & 2: Learned weighted average of the first and second layer of the biLM"

If we have to compute learned weighted average of three layers as proposed by Peters et al., then the weights must be softmax normalized and it should also include scaling factor.

My doubt is whether you have calculated weighted average as proposed by Peters et al. or the weighted average with weights which are not softmax normalized like in this link (https://github.com/UKPLab/elmo-bilstm-cnn-crf/blob/master/neuralnets/keraslayers/WeightedAverage.py).

nreimers commented 5 years ago

Hi @kalyanks0611 Thanks for pointing this out.

The upper half is indeed averaged as you find it here in the repository.

The lower half of the table, which uses the AllenNLP system, uses softmax-normalized weights and scaling.

So right now there is an unfortune mixing of the two averaging strategies in that table. will update to make the paper more clear in that perspective.

ghost commented 5 years ago

How did you manage to get reproducible results using keras for building models and allennlp for ELMo embeddings?

I tried all the below steps in order to get the reproducible results but I couldn't

import numpy as np import tensorflow as tf import random as rn

np.random.seed(42)

rn.seed(12345)

session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)

from keras import backend as K

tf.set_random_seed(1234)

sess = tf.Session(graph=tf.get_default_graph(), config=session_conf) K.set_session(sess)

What do I have to do further to get the reproducible results?

nreimers commented 5 years ago

Hi, AllenNLP uses pytorch and sadly ELMo is not deterministic.

So you need to seed also torch https://pytorch.org/docs/stable/notes/randomness.html

However, relying on a single seed is usually a bad scientific setup. Instead I recommend to always experiment with multiple seeds and to average results.

Best Nils Reimers

ghost commented 5 years ago

If I set the seed value for pytorch in addition to python, numpy and tensorflow , do I get the reproducible results (with very less variation) for a given seed? I'm executing all the code on GPU

nreimers commented 5 years ago

When all seeds are set, you should get deterministic results. When running on GPU, you might need to set the seed also there (can't remember as I avoid to set fixed seeds)

ghost commented 5 years ago

Thanks, @nreimers for clarifying all my doubts and sharing valuable information which helped me to finish my paper and I'm going to submit this paper to one of the EMNLP 2019 workshops and I have cited your paper in it.

nreimers commented 5 years ago

Happy to hear that I was of help. Good luck with your paper!