Closed mihirkale815 closed 6 years ago
Yes, that is my intuition too, but I'm not sure that's the case. We had some preliminary results using a biLM without residual connections and still found the scalar weighting to help. We never ran any follow up experiments with the model w/o residual connections to do a careful ablation of the impact with scalar weighting, as the model w/ residual connections that we released performed better.
@matt-peters first of all thanks a lot for publishing your code, it is great to have a practical medium of your research work. I'm currently translating your BILM code into Keras and I had a question in relation to @mihirkale815 point.
Why your group selected to not use a residual connection between the initial token embeddings (produced by either the token character encoder or the simple embedding layer) and the 1st LSTM layers? I'm talking about line 380 in bilm-tf/bilm/training.py
Based on the fact that your final goal is to produce an ELMO embedding as the weighted average of both the LSTM outputs and the initial token embedding, as it is described in your paper; wouldn't this initial residual connection "force" the network to keep the dimensions aligned? Is it not possible that after training the BILM, there will be no correlation between the initial embedding and the latter LSTM outputs, in order for the addition (weighted average) to make sense? I mean the transformation happening inside the LSTM cell has not such guaranties and there is not another part of the network that directly enforce this alignment.
I have also some more questions with respect to the implementation that I would really want to discuss with you. Is the issues' environment the right place to add more questions or would you recommend another alternative?
We didn't start the ELMo project planning to use the linear combination of layers, that was something we discovered along way though experimentation. As a result, the biLM architecture evolved independently of the linear combination use case (and e.g. using just the top layer is a perfectly sensible thing to do). The details of the residual connection between just the LSTM layers was borrowed from Melis et al, "On the State of the Art of Evaluation in Neural Language Models", https://arxiv.org/abs/1707.05589 (in particular, see Figure 1 in v1 of the paper, which was the version available when we ran the experiments and wrote the paper; v2 appeared later).
Does the linear combination of embeddings mainly work due to skip connections / highway layers in the model? Without these, the embeddings from different layers will be in different vector spaces and addition does not make sense. If the BiLM is trained without the skip/highway connections, I'd expect ELMo to either given bad results or the linear combination will have all its weight on the most informative embedding layer, while the weights for other layers will be zero. Is this intuition correct?