Closed Alaska47 closed 5 years ago
Hi, thanks for the very clear issue.
A couple of things:
I think the non-determinism you are seeing is actually due to dropout, as you are passing a value of 0.5. This will make every run non-deterministic. In general you need to call module.eval()
to turn dropout off.
Your code snippet looks like it calls the generate_embeddings_bilstm
in both for loops, but i'm assuming that's not the case because the outputs are quite different.
You are correct that the BidirectionalLanguageModelTokenEmbedder
computes a scalar mixture of the N layers of the model before returning the embeddings to you. This is annoying to fix because we haven't designed the API for getting embeddings super well. Here is one way you can do it:
class TransformerElmoWrapper(BidirectionalLanguageModelTokenEmbedder):
def forward(self, # type: ignore
inputs: torch.Tensor) -> Dict[str, torch.Tensor]:
"""
Parameters
----------
inputs: ``torch.Tensor``
Shape ``(batch_size, timesteps, ...)`` of token ids representing the current batch.
These must have been produced using the same indexer the LM was trained on.
Returns
-------
The bidirectional language model representations for the input sequence, shape
``(batch_size, timesteps, embedding_dim)``
"""
# pylint: disable=arguments-differ
if self._bos_indices is not None:
mask = get_text_field_mask({"": inputs})
inputs, mask = add_sentence_boundary_token_ids(
inputs, mask, self._bos_indices, self._eos_indices
)
source = {self._token_name: inputs}
result_dict = self._lm(source)
# shape (batch_size, timesteps, embedding_size)
noncontextual_token_embeddings = result_dict["noncontextual_token_embeddings"]
contextual_embeddings = result_dict["lm_embeddings"]
# Typically the non-contextual embeddings are smaller than the contextualized embeddings.
# Since we're averaging all the layers we need to make their dimensions match. Simply
# repeating the non-contextual embeddings is a crude, but effective, way to do this.
duplicated_character_embeddings = torch.cat(
[noncontextual_token_embeddings] * self._character_embedding_duplication_count, -1
)
averaged_embeddings = self._scalar_mix(
[duplicated_character_embeddings] + contextual_embeddings
)
all_embeddings = [duplicated_character_embeddings] + contextual_embeddings
# Add dropout
averaged_embeddings = self._dropout(averaged_embeddings)
if self._remove_bos_eos:
averaged_embeddings, _ = remove_sentence_boundaries(
averaged_embeddings, result_dict["mask"]
)
all_embeddings = [remove_sentence_boundaries(x, result_dict["mask"]) for x in all_embeddings]
return {"averaged_embeddings": averaged_embeddings, "lm_embeddings": all_embeddings}
Basically it's just a wrapper around the existing token embedder which returns all of the embeddings (they will be a list of size N of (batch_size, sentence_length, lm_dim) tensors). Does that clear things up for you?
Thanks for the quick response.
I ended up coming to that same conclusion after reading this issue. Should I also disable dropout when I'm training my model? Specifically, the problem I'm doing is named entity recognition. I'm using ELMo embeddings on top of a biLSTM CRF (using code from https://github.com/Hironsan/anago). When I'm training the model, should I get the embeddings using .eval()
or should I get the embeddings non-deterministically? Along the same lines, what is the benefit of enabling dropout to get non-deterministic embeddings when training? Wouldn't I want the embeddings to remain the same throughout epochs as to not confuse the network?
You are correct, it was a typo :)
What exactly is the difference between averaged_embeddings
and lm_embeddings
? So the averaged embeddings seems to be a tensor of size (sentence_length, 1024). Is all_embeddings
just a list of size N of the outputs from each layer in the transformer? Additionally, how is the biLSTM implementation of ELMo getting the embeddings (is it doing a weighted average of the CNN and 2 LSTMs or is it just returning one of the layers' output)?
Other than that, thanks for the help! I'm just getting started with ELMo and your library and code has been very helpful in understanding it.
You might want to look at the allennlp tutoral here: https://allennlp.org/tutorials
and the training configurations here: https://github.com/allenai/allennlp/blob/master/training_config/ner_elmo.jsonnet
for NER within allennlp, as you'll find the integration with elmo a lot easier. During training, you probably want to leave the dropout on, assuming that you are regenerating the elmo embeddings every epoch. Dropout is a form of regularisation - you can read about it in the original paper.
The averaged embeddings are simply a weighted combination of all the layers, where the weighting is designed to be learned for a downstream task. The lm layers in the code snippet above are just all of the outputs of each layer in the transformer, like you said.
System
Question So after reading a couple issues regarding statefulness and non-deterministic nature of ELMo models, I think I understand why calculating the ELMo embeddings for a single sentence is different each time I do it.
From this comment it seems like using the transformer implementation should give me more consistent results when I get embeddings for the same sentence.
However, when running the following code, I see the opposite result. After a couple runs, the biLSTM implementation seems to stabilize and provide the same embeddings for each run. The transformer implementation still continues to give varying embeddings for the same sentence without stabilizing.
Code:
Results:
I suspect the difference between the two is in
allennlp/modules/token_embedders/language_model_token_embedder.py
, it seems that the embeddings are calculated by taking an average of all the layers. The biLSTM, however, seems to be calculating the embeddings by taking some linear combination of the cnn and two lstm layers. I'm not quite sure how to get the same result with the transformer implementation.Is there a way to get embeddings from the ELMo Transformer with similar behavior as seen with the biLSTM implementation? I want to calculate embeddings using the ELMo Transformer mainly for speed when dealing with sentences with large amounts of tokens.