Inconsistent results when comparing ELMo Transformer with ELMo biLSTM

Alaska47 commented 5 years ago

System

OS: Linux
Python version: 3.6.6
AllenNLP version: 0.8.4
PyTorch version: 1.1.0

Question So after reading a couple issues regarding statefulness and non-deterministic nature of ELMo models, I think I understand why calculating the ELMo embeddings for a single sentence is different each time I do it.

From this comment it seems like using the transformer implementation should give me more consistent results when I get embeddings for the same sentence.

However, when running the following code, I see the opposite result. After a couple runs, the biLSTM implementation seems to stabilize and provide the same embeddings for each run. The transformer implementation still continues to give varying embeddings for the same sentence without stabilizing.

Code:

lm_embedder = BidirectionalLanguageModelTokenEmbedder(
    archive_file=lm_model_file,
    dropout=0.5,
    bos_eos_tokens=["<S>", "</S>"],
    remove_bos_eos=True,
    requires_grad=False
)

indexer = ELMoTokenCharactersIndexer()
vocab = lm_embedder._lm.vocab

elmo = Elmo(options_file, weight_file, 2, dropout=0)

def generate_embeddings_transformer(batch_sentences):
    start = datetime.now()
    tokens = [str.split(sent) for sent in batch_sentences]
    character_ids = batch_to_ids(tokens)
    embeddings = lm_embedder(character_ids).detach().numpy()
    print('transformer time used: {} sec'.format((datetime.now() - start).total_seconds()))
    return embeddings

def generate_embeddings_bilstm(batch_sentences):
    start = datetime.now()
    tokens = [str.split(sent) for sent in batch_sentences]
    character_ids = batch_to_ids(tokens)
    elmo_embeddings = elmo(character_ids)['elmo_representations'][1]
    elmo_embeddings = elmo_embeddings.detach().numpy()
    print('bilstm time used: {} sec'.format((datetime.now() - start).total_seconds()))
    return elmo_embeddings

for x in range(10):
    print("Sum of bilstm embedding vectors for 'this is a test' at iteration " + str(x))
    print(generate_embeddings_bilstm(["this is a test"]).sum())

for x in range(10):
    print("Sum of transformer embedding vectors for 'this is a test' at iteration " + str(x))
    print(generate_embeddings_transformer(["this is a test"]).sum())

Results:

Sum of bilstm embedding vectors for 'this is a test' at iteration 0
bilstm time used: 0.089708 sec
7.367153
Sum of bilstm embedding vectors for 'this is a test' at iteration 1
bilstm time used: 0.081281 sec
9.796036
Sum of bilstm embedding vectors for 'this is a test' at iteration 2
bilstm time used: 0.08271 sec
8.870555
Sum of bilstm embedding vectors for 'this is a test' at iteration 3
bilstm time used: 0.081216 sec
8.817304
Sum of bilstm embedding vectors for 'this is a test' at iteration 4
bilstm time used: 0.079185 sec
8.813159
Sum of bilstm embedding vectors for 'this is a test' at iteration 5
bilstm time used: 0.0773 sec
8.812587
Sum of bilstm embedding vectors for 'this is a test' at iteration 6
bilstm time used: 0.075793 sec
8.81244
Sum of bilstm embedding vectors for 'this is a test' at iteration 7
bilstm time used: 0.063651 sec
8.812386
Sum of bilstm embedding vectors for 'this is a test' at iteration 8
bilstm time used: 0.068668 sec
8.812345
Sum of bilstm embedding vectors for 'this is a test' at iteration 9
bilstm time used: 0.063457 sec
8.812339
Sum of transformer embedding vectors for 'this is a test' at iteration 0
transformer time used: 0.033594 sec
-274.89505
Sum of transformer embedding vectors for 'this is a test' at iteration 1
transformer time used: 0.033124 sec
406.75504
Sum of transformer embedding vectors for 'this is a test' at iteration 2
transformer time used: 0.032991 sec
881.2833
Sum of transformer embedding vectors for 'this is a test' at iteration 3
transformer time used: 0.033136 sec
174.99849
Sum of transformer embedding vectors for 'this is a test' at iteration 4
transformer time used: 0.032945 sec
421.04303
Sum of transformer embedding vectors for 'this is a test' at iteration 5
transformer time used: 0.032884 sec
515.542
Sum of transformer embedding vectors for 'this is a test' at iteration 6
transformer time used: 0.032984 sec
750.3109
Sum of transformer embedding vectors for 'this is a test' at iteration 7
transformer time used: 0.033035 sec
307.55313
Sum of transformer embedding vectors for 'this is a test' at iteration 8
transformer time used: 0.032972 sec
335.28522
Sum of transformer embedding vectors for 'this is a test' at iteration 9
transformer time used: 0.033172 sec
132.0341

I suspect the difference between the two is in allennlp/modules/token_embedders/language_model_token_embedder.py, it seems that the embeddings are calculated by taking an average of all the layers. The biLSTM, however, seems to be calculating the embeddings by taking some linear combination of the cnn and two lstm layers. I'm not quite sure how to get the same result with the transformer implementation.

Is there a way to get embeddings from the ELMo Transformer with similar behavior as seen with the biLSTM implementation? I want to calculate embeddings using the ELMo Transformer mainly for speed when dealing with sentences with large amounts of tokens.

DeNeutoy commented 5 years ago

Hi, thanks for the very clear issue.

A couple of things:

I think the non-determinism you are seeing is actually due to dropout, as you are passing a value of 0.5. This will make every run non-deterministic. In general you need to call module.eval() to turn dropout off.
Your code snippet looks like it calls the generate_embeddings_bilstm in both for loops, but i'm assuming that's not the case because the outputs are quite different.
You are correct that the BidirectionalLanguageModelTokenEmbedder computes a scalar mixture of the N layers of the model before returning the embeddings to you. This is annoying to fix because we haven't designed the API for getting embeddings super well. Here is one way you can do it:


class TransformerElmoWrapper(BidirectionalLanguageModelTokenEmbedder):
    def forward(self,  # type: ignore
                inputs: torch.Tensor) -> Dict[str, torch.Tensor]:
        """
        Parameters
        ----------
        inputs: ``torch.Tensor``
            Shape ``(batch_size, timesteps, ...)`` of token ids representing the current batch.
            These must have been produced using the same indexer the LM was trained on.
        Returns
        -------
        The bidirectional language model representations for the input sequence, shape
        ``(batch_size, timesteps, embedding_dim)``
        """
        # pylint: disable=arguments-differ
        if self._bos_indices is not None:
            mask = get_text_field_mask({"": inputs})
            inputs, mask = add_sentence_boundary_token_ids(
                    inputs, mask, self._bos_indices, self._eos_indices
            )

        source = {self._token_name: inputs}
        result_dict = self._lm(source)

        # shape (batch_size, timesteps, embedding_size)
        noncontextual_token_embeddings = result_dict["noncontextual_token_embeddings"]
        contextual_embeddings = result_dict["lm_embeddings"]

        # Typically the non-contextual embeddings are smaller than the contextualized embeddings.
        # Since we're averaging all the layers we need to make their dimensions match. Simply
        # repeating the non-contextual embeddings is a crude, but effective, way to do this.
        duplicated_character_embeddings = torch.cat(
                [noncontextual_token_embeddings] * self._character_embedding_duplication_count, -1
        )
        averaged_embeddings = self._scalar_mix(
                [duplicated_character_embeddings] + contextual_embeddings
        )
        all_embeddings = [duplicated_character_embeddings] + contextual_embeddings
        # Add dropout
        averaged_embeddings = self._dropout(averaged_embeddings)
        if self._remove_bos_eos:
            averaged_embeddings, _ = remove_sentence_boundaries(
                    averaged_embeddings, result_dict["mask"]
            )
            all_embeddings = [remove_sentence_boundaries(x, result_dict["mask"]) for x in all_embeddings]

    return {"averaged_embeddings": averaged_embeddings, "lm_embeddings": all_embeddings}

Basically it's just a wrapper around the existing token embedder which returns all of the embeddings (they will be a list of size N of (batch_size, sentence_length, lm_dim) tensors). Does that clear things up for you?

Alaska47 commented 5 years ago

Thanks for the quick response.

I ended up coming to that same conclusion after reading this issue. Should I also disable dropout when I'm training my model? Specifically, the problem I'm doing is named entity recognition. I'm using ELMo embeddings on top of a biLSTM CRF (using code from https://github.com/Hironsan/anago). When I'm training the model, should I get the embeddings using .eval() or should I get the embeddings non-deterministically? Along the same lines, what is the benefit of enabling dropout to get non-deterministic embeddings when training? Wouldn't I want the embeddings to remain the same throughout epochs as to not confuse the network?
You are correct, it was a typo :)
What exactly is the difference between averaged_embeddings and lm_embeddings? So the averaged embeddings seems to be a tensor of size (sentence_length, 1024). Is all_embeddings just a list of size N of the outputs from each layer in the transformer? Additionally, how is the biLSTM implementation of ELMo getting the embeddings (is it doing a weighted average of the CNN and 2 LSTMs or is it just returning one of the layers' output)?

Other than that, thanks for the help! I'm just getting started with ELMo and your library and code has been very helpful in understanding it.

DeNeutoy commented 5 years ago

You might want to look at the allennlp tutoral here: https://allennlp.org/tutorials

and the training configurations here: https://github.com/allenai/allennlp/blob/master/training_config/ner_elmo.jsonnet

for NER within allennlp, as you'll find the integration with elmo a lot easier. During training, you probably want to leave the dropout on, assuming that you are regenerating the elmo embeddings every epoch. Dropout is a form of regularisation - you can read about it in the original paper.

The averaged embeddings are simply a weighted combination of all the layers, where the weighting is designed to be learned for a downstream task. The lm layers in the code snippet above are just all of the outputs of each layer in the transformer, like you said.

allenai / allennlp

Inconsistent results when comparing ELMo Transformer with ELMo biLSTM #2971