[Question]: How do I work out what my custom tokenizer is missing?

larsbun commented 1 year ago

Question

Hi,

I am working on using embeddings from a pre-trained model which is not published. When I try to import it as a TransformerWordEmbedding, it fails with this error message:

Traceback (most recent call last):
  File "seqseq-single.py", line 111, in <module>
    trainer.train(experiment_root,
  File "flair3.8/lib/python3.8/site-packages/flair/trainers/trainer.py", line 304, in train
    context_stack.enter_context(
  File "/usr/lib/python3.8/contextlib.py", line 425, in enter_context
    result = _cm_type.__enter__(cm)
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "flair3.8/lib/python3.8/site-packages/transformer_smaller_training_vocab/contextual_reduce.py", line 32, in reduce_train_vocab
    saved_vocab = reduce_tokenizer(tokenizer, used_tokens)
  File "flair3.8/lib/python3.8/site-packages/transformer_smaller_training_vocab/modify_tokenizer.py", line 14, in reduce_tokenizer
    set_vocab(tokenizer, reduced_vocab)
  File "flair3.8/lib/python3.8/site-packages/transformer_smaller_training_vocab/transformer_set_vocab/auto_set_vocab.py", line 35, in set_vocab
    set_vocab_function = get_set_vocab_function(tokenizer_cls)
  File "flair3.8/lib/python3.8/site-packages/transformer_smaller_training_vocab/transformer_set_vocab/auto_set_vocab.py", line 30, in get_set_vocab_function
    raise ValueError(f"type '{tokenizer_cls}' has no implementation for setting the vocabulary.")  # pragma: no cover
ValueError: type '<class 'Tokenizer'>' has no implementation for setting the vocabulary.

I tried looking at the code at the relevant places, but there were so many layers of abstraction that I was unable to work it out. I do suspect, however, that there is only a little detail missing from it. How can I find out what this implementation for setting the vocabulary should look like (and thereby fix it?)

helpmefindaname commented 1 year ago

Hi @larsbun it looks to me, like you are trying to make use of https://github.com/helpmefindaname/transformer-smaller-training-vocab while having a tokenizer that is not supported.

That said, it should work when you don't set reduce_transformer_vocab=True on the trainer.train method.

If you still want to use that library, you can create an issue there and detail your tokenizer class for some support.

larsbun commented 1 year ago

To be specific, it's the SequenceTagger in flair which calls the reduce_transformer_vocab, and I am wondering what it takes to make this tokenizer supported. It is not at all to be clear to me what reduce_transformer_vocab actually does and what it is necessary, and what's required to do it, etc. I tried setting reduce_transformer_vocab=False, for trainer.train, but the result was the same.

helpmefindaname commented 1 year ago

Hi @larsbun when you talk about specifics, it would be nice if you would share the version you are using and the code you were running. The SequenceTagger is not running anything, but the trainer does when it is activated. Assuming you are on the latest version, this is either a bug or some issues with the parameters.

flairNLP / flair

[Question]: How do I work out what my custom tokenizer is missing? #3238

Question