allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.77k stars 2.25k forks source link

MultiLabelField not being indexed correctly with pre-trained transformer #5430

Closed david-waterworth closed 3 years ago

david-waterworth commented 3 years ago

This is probably a user error but I cannot find a jsonl vocab constructor which works correctly with a MultiLabelField (i.e. a multi-label classifier).

I need to set the vocabs unk and pad token as I'm using a huggingface transformer, and of course, I need to index the labels.

When I use from_pretrained_transformer to construct my vocabulary there are two issues, first, when MultiLabelField.index is called, the vocab only contains a tokens namespace, no labels. This causes 'index' to crash - oddly vocab.get_token_index(label, self._label_namespace) returns 1 (one) for every label despite the namespace not existing, should it not return an error?

vocabulary: {
    type: "from_pretrained_transformer",
    model_name: "models/transformer",
}

Also inspecting the vocab object I'm seeing

_oov_token:'\<unk>' _padding_token:'@@PADDING@@'

So it's failed to infer the padding token. From what I can see the from_pretrained_transformer has no padding_token argument?

If I use 'from_instances' it indexes the labels correctly but afaik it's reindexing the original vocab but it's out of alignment.

My model is

vocabulary: {
    type: "from_pretrained_transformer",
    model_name: "models/transformer",
},
dataset_reader: {
    type: "multi_label",
    tokenizer: {
      type: "pretrained_transformer",
      model_name: "models/transformer"
    },
    token_indexers: {
        tokens: {
            type: "pretrained_transformer",
            model_name: "models/transformer",
            namespace: "tokens" 
        },
    },
},
model: {
    type: "multi_label",
    text_field_embedder: {
        token_embedders: {
            tokens: {
                type: "pretrained_transformer",
                model_name: "models/transformer"
            }
        },
    },
    seq2vec_encoder: {
        type: "bert_pooler",
        pretrained_model: "models/transformer",
        dropout: 0.1,
    },
},
david-waterworth commented 3 years ago

I got it working, rather than use from_pretrained_transformer I used the default (i.e. no vocab entry in the jsonl file).

It seems that indexes using from_instance (which I assume produces an incorrect token/id mapping), but then the PretrainedTransformerIndexer re-indexes the tokens namespace correctly using the tokenizer vocab.

I noticed this happen with from_pretrained_transformer as well, vocab.add_transformer_vocab gets called twice, once by from_pretrained_transformer and once by PretrainedTransformerIndexer._add_encoding_to_vocabulary_if_needed

It seems to be ok, as far as I can tell it's using the correct index/token mapping for the text and special tokens, and by doing it this way there is no tokens.txt file produced.