Closed mingboiz closed 8 months ago
PRs are super welcome! If you will fix it, we will be very grateful.
hi @eiennohito I have a working draft PR here: but 4/20 tests are failing (those that are using word_form_type normalized-related forms)
For example, here is one failing test:
def test_sudachipy_tokenizer_normalized_form(self):
tokenizer = self.tokenizer_class(self.vocab_file, do_subword_tokenize=False,
word_form_type='normalized')
self.assertListEqual(
tokenizer.tokenize("appleの辞書形はAppleで正規形はアップルである。"),
["アップル", "の", "辞書", "形", "は", "アップル", "で", "正規", "形", "は", "アップル", "だ", "有る", "。"]
)
the difference was that Apple/apple gets normalized to アップル, which looks correct
- ['Apple',
+ ['アップル',
...
- 'Apple',
+ 'アップル',
I also reran the same tests from a branch forked from latest master which also failed the same 4/20 tests. I had only changes to transformers requirements to 4.33.3, its related tokenizers versions, pinned python version to 3.9 which had prebuilt wheels so tests could run .
I don't have Japanese language knowledge on this can you please guide me on the failing tests? is there a way for me to get the correct test output? I read the source code and tried to tokenize using sudachipy manually to get the normalized form, but did not get the expected test results as well
(base) ✘ yangming@ming ~/Code/SudachiTra check/test /usr/bin/python3 -m pip freeze | grep -i sudachi
SudachiDict-core==20230927
SudachiPy==0.6.7
SudachiTra==0.1.8
(base) yangming@ming ~/Code/SudachiTra check/test /usr/bin/python3
Python 3.9.6 (default, Aug 11 2023, 19:44:49)
[Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from sudachipy import Dictionary, SplitMode
>>> tokenizer = Dictionary().create()
>>> text = "appleの辞書形はAppleで正規形はアップルである。"
>>> morphemes = tokenizer.tokenize(text)
>>> morphemes
<MorphemeList[
<Morpheme(apple, 0:5, (0, 551))>,
<Morpheme(の, 5:6, (0, 119137))>,
<Morpheme(辞書形, 6:9, (0, 1533531))>,
<Morpheme(は, 9:10, (0, 121601))>,
<Morpheme(Apple, 10:15, (0, 551))>,
<Morpheme(で, 15:16, (0, 101431))>,
<Morpheme(正規, 16:18, (0, 524191))>,
<Morpheme(形, 18:19, (0, 422564))>,
<Morpheme(は, 19:20, (0, 121601))>,
<Morpheme(アップル, 20:24, (0, 173324))>,
<Morpheme(で, 24:25, (0, 101428))>,
<Morpheme(ある, 25:27, (0, 12718))>,
<Morpheme(。, 27:28, (0, 6912))>,
]>
>>> morphemes[0].normalized_form()
'Apple'
>>> morphemes[0].surface()
'apple
There were changes in the dictionary wrt these changes. Please submit PR as WIP, I will think about what can be done.
@eiennohito thanks! I have opened a WIP PR at here it can't seem to link to this issue: https://github.com/WorksApplications/SudachiTra/pull/67
System Info
transformers version: 4.34.0 Platform: linux Python version: 3.9.18 sudachitra version: 0.1.8 sudachipy version: 0.6.7 sudachi-core version:20230927
Upstream changes in transformers due to PR: https://github.com/huggingface/transformers/pull/23909 causes error when running the example over at: https://huggingface.co/megagonlabs/transformers-ud-japanese-electra-base-discriminator this happens for other custom tokenizers as well: https://github.com/huggingface/transformers/issues/26777
If it is ok - I would like to contribute and submit a PR to fix for this issue.