allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 451 forks source link

Linear Combination of Embeddings #95

Closed mihirkale815 closed 6 years ago

mihirkale815 commented 6 years ago

Does the linear combination of embeddings mainly work due to skip connections / highway layers in the model? Without these, the embeddings from different layers will be in different vector spaces and addition does not make sense. If the BiLM is trained without the skip/highway connections, I'd expect ELMo to either given bad results or the linear combination will have all its weight on the most informative embedding layer, while the weights for other layers will be zero. Is this intuition correct?

matt-peters commented 6 years ago

Yes, that is my intuition too, but I'm not sure that's the case. We had some preliminary results using a biLM without residual connections and still found the scalar weighting to help. We never ran any follow up experiments with the model w/o residual connections to do a careful ablation of the impact with scalar weighting, as the model w/ residual connections that we released performed better.

iliaschalkidis commented 6 years ago

@matt-peters first of all thanks a lot for publishing your code, it is great to have a practical medium of your research work. I'm currently translating your BILM code into Keras and I had a question in relation to @mihirkale815 point.

I have also some more questions with respect to the implementation that I would really want to discuss with you. Is the issues' environment the right place to add more questions or would you recommend another alternative?

matt-peters commented 6 years ago

We didn't start the ELMo project planning to use the linear combination of layers, that was something we discovered along way though experimentation. As a result, the biLM architecture evolved independently of the linear combination use case (and e.g. using just the top layer is a perfectly sensible thing to do). The details of the residual connection between just the LSTM layers was borrowed from Melis et al, "On the State of the Art of Evaluation in Neural Language Models", https://arxiv.org/abs/1707.05589 (in particular, see Figure 1 in v1 of the paper, which was the version available when we ran the experiments and wrote the paper; v2 appeared later).