Closed david-waterworth closed 3 years ago
I got it working, rather than use from_pretrained_transformer
I used the default (i.e. no vocab entry in the jsonl file).
It seems that indexes using from_instance
(which I assume produces an incorrect token/id mapping), but then the PretrainedTransformerIndexer
re-indexes the tokens namespace correctly using the tokenizer vocab.
I noticed this happen with from_pretrained_transformer
as well, vocab.add_transformer_vocab
gets called twice, once by from_pretrained_transformer
and once by PretrainedTransformerIndexer._add_encoding_to_vocabulary_if_needed
It seems to be ok, as far as I can tell it's using the correct index/token mapping for the text and special tokens, and by doing it this way there is no tokens.txt file produced.
This is probably a user error but I cannot find a jsonl vocab constructor which works correctly with a MultiLabelField (i.e. a multi-label classifier).
I need to set the vocabs
unk
andpad
token as I'm using a huggingface transformer, and of course, I need to index the labels.When I use
from_pretrained_transformer
to construct my vocabulary there are two issues, first, whenMultiLabelField.index
is called, the vocab only contains a tokens namespace, no labels. This causes 'index' to crash - oddlyvocab.get_token_index(label, self._label_namespace)
returns 1 (one) for every label despite the namespace not existing, should it not return an error?Also inspecting the vocab object I'm seeing
_oov_token:'\<unk>' _padding_token:'@@PADDING@@'
So it's failed to infer the padding token. From what I can see the from_pretrained_transformer has no
padding_token
argument?If I use 'from_instances' it indexes the labels correctly but afaik it's reindexing the original vocab but it's out of alignment.
My model is