dariush-bahrami / character-tokenizer

A character tokenizer for Hugging Face Transformers
MIT License
27 stars 13 forks source link

bug fix: NotImplementedError when constructing CharacterTokenizer #2

Closed kkew3 closed 4 months ago

kkew3 commented 4 months ago

Abstract

Under transformers==4.41.2, constructing CharacterTokenizer raises NotImplementedError.

Minimal reproducible example

Command to reproduce the error (from command line), plus the error output:

$ python3 -c "from charactertokenizer import CharacterTokenizer; _ = CharacterTokenizer('abc', 1024)"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/user/Documents/Projects/python3/proj/dariush-bahrami+character-tokenizer/charactertokenizer/core.py", line 44, in __init__
    super().__init__(
  File "/Users/user/Documents/Projects/python3/proj/dariush-bahrami+character-tokenizer/venv/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 367, in __init__
    self._add_tokens(
  File "/Users/user/Documents/Projects/python3/proj/dariush-bahrami+character-tokenizer/venv/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
    current_vocab = self.get_vocab().copy()
  File "/Users/user/Documents/Projects/python3/proj/dariush-bahrami+character-tokenizer/venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1682, in get_vocab
    raise NotImplementedError()
NotImplementedError

which is caused by not implementing the get_vocab() method required by the super class.

My fix

In order to fix the error, I add the required get_vocab() method, and adjust several statements in __init__() in order to get get_vocab() work.

dariush-bahrami commented 4 months ago

Thank you for fixing this issue