WorksApplications / SudachiTra

Japanese tokenizer for Transformers
Apache License 2.0
77 stars 10 forks source link

sudachitra and other custom tokenizers no longer compatible with transformers later than 4.34 #66

Closed mingboiz closed 8 months ago

mingboiz commented 9 months ago

System Info

transformers version: 4.34.0 Platform: linux Python version: 3.9.18 sudachitra version: 0.1.8 sudachipy version: 0.6.7 sudachi-core version:20230927

Upstream changes in transformers due to PR: https://github.com/huggingface/transformers/pull/23909 causes error when running the example over at: https://huggingface.co/megagonlabs/transformers-ud-japanese-electra-base-discriminator this happens for other custom tokenizers as well: https://github.com/huggingface/transformers/issues/26777

from sudachitra import ElectraSudachipyTokenizer
tokenizer = ElectraSudachipyTokenizer.from_pretrained("megagonlabs/transformers-ud-japanese-electra-base-discriminator")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lib64/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2045, in from_pretrained
    return cls._from_pretrained(
  File "/home/lib64/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/lib64/python3.9/site-packages/sudachitra/tokenization_bert_sudachipy.py", line 155, in __init__
    super().__init__(
  File "/home/lib64/python3.9/site-packages/transformers/tokenization_utils.py", line 366, in __init__
    self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
  File "/home/lib64/python3.9/site-packages/transformers/tokenization_utils.py", line 462, in _add_tokens
    current_vocab = self.get_vocab().copy()
  File "/home/lib64/python3.9/site-packages/sudachitra/tokenization_bert_sudachipy.py", line 218, in get_vocab
    return dict(self.vocab, **self.added_tokens_encoder)
AttributeError: 'ElectraSudachipyTokenizer' object has no attribute 'vocab'

If it is ok - I would like to contribute and submit a PR to fix for this issue.

eiennohito commented 9 months ago

PRs are super welcome! If you will fix it, we will be very grateful.

mingboiz commented 9 months ago

hi @eiennohito I have a working draft PR here: but 4/20 tests are failing (those that are using word_form_type normalized-related forms)

For example, here is one failing test:

def test_sudachipy_tokenizer_normalized_form(self):
    tokenizer = self.tokenizer_class(self.vocab_file, do_subword_tokenize=False,
                                     word_form_type='normalized')

    self.assertListEqual(
        tokenizer.tokenize("appleの辞書形はAppleで正規形はアップルである。"),
        ["アップル", "の", "辞書", "形", "は", "アップル", "で", "正規", "形", "は", "アップル", "だ", "有る", "。"]
    )

the difference was that Apple/apple gets normalized to アップル, which looks correct

- ['Apple',
+ ['アップル',
...
-  'Apple',
+  'アップル',

I also reran the same tests from a branch forked from latest master which also failed the same 4/20 tests. I had only changes to transformers requirements to 4.33.3, its related tokenizers versions, pinned python version to 3.9 which had prebuilt wheels so tests could run .

I don't have Japanese language knowledge on this can you please guide me on the failing tests? is there a way for me to get the correct test output? I read the source code and tried to tokenize using sudachipy manually to get the normalized form, but did not get the expected test results as well

(base)  ✘ yangming@ming  ~/Code/SudachiTra   check/test  /usr/bin/python3 -m pip freeze | grep -i sudachi
SudachiDict-core==20230927
SudachiPy==0.6.7
SudachiTra==0.1.8
(base)  yangming@ming  ~/Code/SudachiTra   check/test  /usr/bin/python3            
Python 3.9.6 (default, Aug 11 2023, 19:44:49) 
[Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from sudachipy import Dictionary, SplitMode
>>> tokenizer = Dictionary().create()
>>> text = "appleの辞書形はAppleで正規形はアップルである。"
>>> morphemes = tokenizer.tokenize(text)
>>> morphemes
<MorphemeList[
  <Morpheme(apple, 0:5, (0, 551))>,
  <Morpheme(の, 5:6, (0, 119137))>,
  <Morpheme(辞書形, 6:9, (0, 1533531))>,
  <Morpheme(は, 9:10, (0, 121601))>,
  <Morpheme(Apple, 10:15, (0, 551))>,
  <Morpheme(で, 15:16, (0, 101431))>,
  <Morpheme(正規, 16:18, (0, 524191))>,
  <Morpheme(形, 18:19, (0, 422564))>,
  <Morpheme(は, 19:20, (0, 121601))>,
  <Morpheme(アップル, 20:24, (0, 173324))>,
  <Morpheme(で, 24:25, (0, 101428))>,
  <Morpheme(ある, 25:27, (0, 12718))>,
  <Morpheme(。, 27:28, (0, 6912))>,
]>
>>> morphemes[0].normalized_form()
'Apple'
>>> morphemes[0].surface()
'apple
eiennohito commented 9 months ago

There were changes in the dictionary wrt these changes. Please submit PR as WIP, I will think about what can be done.

mingboiz commented 9 months ago

@eiennohito thanks! I have opened a WIP PR at here it can't seem to link to this issue: https://github.com/WorksApplications/SudachiTra/pull/67