WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
392 stars 50 forks source link

Tokenizing Ellipsis creates empty tokens #120

Open polm opened 4 years ago

polm commented 4 years ago

While working on the spaCy Japanese model support and integrating Sudachi, ran into the issue that the one-character ellipsis () was causing errors. If you tokenize this ellipsis you get three tokens from SudachiPy, with surfaces like ['', '', '…'].

I assume this is a bug but wasn't able to track down where it's happening. I also checked ㍻, and while that is also normalized internally it seems to be output as a single character without issue.

sorami commented 4 years ago

@polm Thank you for the report.

We believe this can be fixed in the same way as this Java Sudachi PR https://github.com/WorksApplications/Sudachi/pull/118 .

Let us look into it.

sorami commented 4 years ago

Currently, the one-character ellipsis is analyzed as follows;

$ echo … | sudachipy
    補助記号,句点,*,*,*,* .
    補助記号,句点,*,*,*,* .
…   補助記号,句点,*,*,*,* .
EOS

By applying the fix already applied to the Java Sudachi https://github.com/WorksApplications/Sudachi/pull/118 , this will be changed to

$ echo … | sudachipy
…   補助記号,句点,*,*,*,* .
    補助記号,句点,*,*,*,* .
    補助記号,句点,*,*,*,* .
EOS

And the empty (zero-length) morphemes will still be there. This is not a bug, but by specification, the expected behavior of Sudachi.

The original input only has one character, so Sudachi allots this for the first morpheme, and set the remainder to be the "zero-length" morphemes. These empty zero-length morphemes do have the normalized form ..

For the case, the input is only one character, but after the normalization, 平成 is 1 morpheme as well hence no empty morphemes.

polm commented 4 years ago

Huh, OK. I applied the fix and got the output above and assumed I had done something wrong.

And the empty (zero-length) morphemes will still be there. This is not a bug, but by specification, the expected behavior of Sudachi.

Where can I see the specification?

I'm curious what the motivation for generating zero-length morphemes is.

sorami commented 4 years ago

Actually, there is no written specification, as far as I know ... The above is according to the main developer @kazuma-t , and I meant something like "that was what Sudachi was intended to do".

I'm curious what the motivation for generating zero-length morphemes is.

This is because the number of tokens after normalization is bigger than the original ones. is normalized to three morphemes . / . / . and there simply aren't enough tokens to align with. It is not that we want to have zero-length morphemes, but that is the only way we can think of.

polm commented 4 years ago

Actually, there is no written specification, as far as I know ... The above is according to the main developer @kazuma-t , and I meant something like "that was what Sudachi was intended to do".

Ah OK then, good to know.

This is because the number of tokens after normalization is bigger than the original ones. … is normalized to three morphemes . / . / . and there simply aren't enough tokens to align with. It is not that we want to have zero-length morphemes, but that is the only way we can think of.

Couldn't you just treat ... as a single morpheme? It's already an entry in small_lex.csv...

sorami commented 4 years ago

Couldn't you just treat ... as a single morpheme? It's already an entry in small_lex.csv...

Right, it is in the lexicon, so the above result is that the analysis result happened to be that way, due to their scores.

We could treat this particular case in a single morpheme way, but in general, that "zero-length morpheme" case can happen.

We let the users configure the character normalization, so it is possible to have cases where the output morphemes are longer than the input.

For example, if the input A produces the longer output B C D, we can think of 3 ways to treat such case;

  1. Each output have the same repeated original form: A -> B(A) C(A) D(A) (If we concatenate the original forms there will be duplicates)
  2. Only the first one keeps the original form: A -> B(A) C() D() (The current behavior of Sudachi)
  3. Always make it a single morpheme: A -> BCD(A)

The approach 3. is that we are giving up the correct analysis. Between 1. and 2., we think that 2. is probably a better solution, thus the current behavior.

@kazuma-t Please correct me if I am explaining something wrong, or elaborate to make it clearer.