Why not add transformer tokens to vocabulary in init phrase

Zessay commented 2 years ago

I met a problem yesterday when I want to get the vocab_size of the vocabulary namespace transformer_tags which I specify in PretrainedTransformerIndexer. I found this namespace doesn't defined when I want to use it in the Model init phrase. To figure out this problem, I read this part of source codes. I find the operation _add_encoding_to_vocabulary_if_needed is called in tokens_to_indices which is called util the _train_epoch starts.

...
class PretrainedTransformerIndexer
...
   def _add_encoding_to_vocabulary_if_needed(self, vocab: Vocabulary) -> None:
        """
        Copies tokens from ```transformers``` model's vocab to the specified namespace.
        """
        if self._added_to_vocabulary:
            return

        vocab.add_transformer_vocab(self._tokenizer, self._namespace)

        self._added_to_vocabulary = True

    @overrides
    def count_vocab_items(self, token: Token, counter: Dict[str, Dict[str, int]]):
        # If we only use pretrained models, we don't need to do anything here.
        pass

    @overrides
    def tokens_to_indices(self, tokens: List[Token], vocabulary: Vocabulary) -> IndexedTokenList:
        self._add_encoding_to_vocabulary_if_needed(vocabulary)
...

It means the transformer_tags namespace I defined in the vocabulary only can be used in the forward process and after it. This causes a probelm that I can't define a linear layer that transforms the model output to vocab_size logits. Of course, I can use from_pretrained_transformer constructor to get a same namespace that I can use in the init phrase. If this is planned, I wonder what's the usage of _add_encoding_to_vocabulary_if_needed in the PretrainedTransformerIndexer. Why not call _add_encoding_to_vocabulary_if_needed in the init method of PretrainedTransformerIndexer so that we can use the specified namespace from the Model init phrase?

Looking forward to your kind reply ~~ Thanks~~

github-actions[bot] commented 2 years ago

@AkshitaB this is just a friendly ping to make sure you haven't forgotten about this issue 😜

dirkgr commented 2 years ago

PretrainedTransformerIndexer doesn't get to see the vocab in its __init__ method. So we can't add it there.

In principle, the way it should work is that AllenNLP discovers the vocab automatically. That's how it worked before standardized vocabularies became standard practice. But it takes a long time to discover a vocab on a big dataset, and most of the time it's not necessary anymore, so we put in this hack to short-cut vocabulary discovery.

You are right that because of this hack, you don't see the whole vocab in Model.__init__(). Usually, we pass the name of the transformer model to Model.__init__(), so it can figure out what it needs from there. But it is not that elegant.

If you have a better idea for how to do it, feel free to make a PR. I'd be happy to review it!

allenai / allennlp

Why not add transformer tokens to vocabulary in init phrase #5510