apertium / apertium-yid

Apertium linguistic data for Yiddish
GNU General Public License v3.0
0 stars 0 forks source link

twol issue with composed characters #1

Closed jonorthwash closed 5 years ago

jonorthwash commented 5 years ago

The twol rule "Umlaut first vowel of words if {ь} occurs later in word" is supposed to, among other things, cause בּאַרג{ь} to become בּערג.

The relevant part in .deps/yid.LR.lexc.hfst: בּאַרג<n><m><pl>:בּאַרג{ь}

The relevant part in yid.autogen.hfst: בּאַרג<n><m><pl>:בּאַרג

What the relevant part in yid.autogen.hfst should look like: בּאַרג<n><m><pl>:בּערג

The odd thing is that applying the twol transducer on raw input works fine:

$ echo "בּ אַ ר ג {ь}" | hfst-strings2fst -S | hfst-compose-intersect -1 - -2 .deps/yid.twol.hfst | hfst-fst2strings
בּאַרג{ь}:בּערג

I believe the issue has something to do with tokenisation of the composed character אַ (U+05D0, U+05B7), especially since non-composed characters seem to work fine, e.g. הױז<n><nt><pl>:הײַזער.

Ftr, I've tried more direct versions of the twol rule in question, including just אַ:ע <=> .#. :Cons* _ [ :Cons | :Vowel ]* %{ь%}:0 ; with similar results.

@flammie, @ftyers, any thoughts about what might be going on, or how to get the desired behaviour here?

(Note: rendering of RtL in browsers may lead to confusion in reading this post. I recommend copying potentially confusing text to somewhere that will flatten everything to LtR, like a terminal or terminal-based editor.)

flammie commented 5 years ago

For xerox legacy tools characters that use two or more codepoints need to be multichar_symbols I think, unless there's been significant changes in the works. Although with twol and regexes everything's separated by spaces and new multichars are automagic so it should work. You might try adding the composed things into lexc as multichar_symbols section first, if the interpretation for twol should be that composed symbol is single character, I think lexc by default does compile one arc per codepoint.

jonorthwash commented 5 years ago

Yep, defining the composed characters as multicharacter symbols in lexc did it. Good tip. Thanks, @flammie!

jonorthwash commented 5 years ago

Closed with 4eceb803c8f4491b53de11a0606727ff576e3ae1.