Support empty alphabet, for simple CJK word segmentation

apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium

http://wiki.apertium.org/wiki/Lttoolbox

GNU General Public License v2.0

18 stars 22 forks source link

Support empty alphabet, for simple CJK word segmentation #75

Open unhammer opened 5 years ago

unhammer commented 5 years ago

Before https://github.com/apertium/lttoolbox/commit/944ed2556c38f058a5118ab5e481b3412aa3e3d8 / https://github.com/apertium/lttoolbox/pull/52
it was possible to use monodix files with an empty <alphabet> in order to segment into all known analyses (presumably symbols without analyses were output as blanks). But after the change, this is no longer possible.

See https://github.com/apertium/lttoolbox/commit/944ed2556c38f058a5118ab5e481b3412aa3e3d8#commitcomment-35679780 for test cases for Chinese/Japanese/Korean.

Maybe the iswalnum test could be turned off by a flag, e.g. lt-proc --no-implicit-alphabet ?

TinoDidriksen commented 5 years ago

Surely this is as trivial as adding a alphabetic_chars.empty() check to the condition.

unhammer commented 5 years ago

What if someone wants only some chars to be unknown-tokenizable?

TinoDidriksen commented 5 years ago

I guess. I say this should be an opt-out, then. Default should be to have as much as possible in the alphabet, and people can then opt-out with something like <alphabet verbatim="true">

unhammer commented 5 years ago

Definitely opt-out, which is why I suggested --no-implicit-alphabet, though an attribute would be great too. However, an attribute would require a change to the binary format, wouldn't it? (If the iswalnum check is in lt-proc, not lt-comp.)

TinoDidriksen commented 5 years ago

The last binary break prepared for this eventuality: https://github.com/apertium/lttoolbox/blob/master/lttoolbox/compression.h#L29 - we can add features without breaking existing files. But yeah, a cmdline flag for now would work.

ftyers commented 5 years ago

Regarding #52 isn't this what the inconditional section is for?

unhammer commented 5 years ago

oh yeah :) @Fred-Git-Hub ↑ would this cover your use-case? With

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

   <alphabet>
   </alphabet>

   <sdefs>
      <sdef n="noun"/>
      <sdef n="verb"/>
   </sdefs>

   <section id="main" type="inconditional">
      <e><p><l>我</l><r>我<s n="noun"/></r></p></e>
      <e><p><l>爱</l><r>爱<s n="verb"/></r></p></e>
      <e><p><l>你</l><r>你<s n="noun"/></r></p></e>
   </section>

</dictionary>

I get

$ echo "我爱你" | lt-proc test.bin
^我/我<noun>$^爱/爱<verb>$^你/你<noun>$

(See http://wiki.apertium.org/wiki/Inconditional#inconditional for more info.)

unhammer commented 5 years ago

well, the problem is that anything without an analysis in inconditional would turn what follows into one big unknown:

$ echo "熊猫 爱你" |lt-proc test.bin   # space after the bear:
^熊猫/*熊猫$ ^爱/爱<verb>$^你/你<noun>$
$ echo "熊猫爱你" |lt-proc test.bin    # no space, big unknown:
^熊猫爱你/*熊猫爱你$

so then you'd have to make sure to put every symbol you might expect to appear before other symbols into inconditional, including foreign ones like a and b.

ftyers commented 5 years ago

Aha, got it @unhammer, that makes sense. In general I think that in order to deal with this properly we need (1) weights in the lexicon, and (2) a special function of lttoolbox that does segmentation... maybe something like the compounding functionality.

unhammer commented 5 years ago

Yeah, I do have the feeling plain LRLM should eventually hit something it can't handle, but I wonder how far you can get with what @Fred-Git-Hub had going (if the language was mostly single-character words, it should be possible without any new features).

Languages like Thai would need something more, but the current weights and compounding features don't look at context – wouldn't context be needed? Even the simple Norwegian case of ^3./3<adj><ord>/3<num>+.<sent>$ you can't solve without looking at words that are not part of the longest-match of any of the analyses.

ftyers commented 5 years ago

Yeah, either you'd be stuck with a unigram model or you'd need to incorporate n-gram information somehow.