apertium / apertium-yid

Apertium linguistic data for Yiddish
GNU General Public License v3.0
0 stars 0 forks source link

weird difference between hfst and lttoolbox transducers #3

Open jonorthwash opened 5 years ago

jonorthwash commented 5 years ago

The analysis of "איז" is different in the hfst and lttoolbox transducers:

$ echo "איז" | hfst-proc yid.automorf.hfst 
^איז/זײַן<v><pres><p3><sg>$

$ echo "איז" | lt-proc yid.automorf.bin
^איז/ז<>ן<v><pres><p3><sg>$

Something similar happens with "אַ":

$ echo "אַ" | hfst-proc yid.automorf.hfst
^אַ/אַ<det><sg>$

$ echo "אַ" | lt-proc yid.automorf.bin 
^א/<><det><sg>$ַ

Any thoughts on what might be going on, @ftyers or @flammie?

jonorthwash commented 5 years ago

Another thing that's different is the following:

$ echo "זיי" | hfst-proc yid.automorf.hfst 
!! Warning: Transducer contains one or more multi-character symbols made up of
ASCII characters which are also available as single-character symbols. The
input stream will always be tokenised using the longest symbols available.
Use the -t option to view the tokenisation. The problematic symbol(s):
וו יי
^זיי/זײ<prn><pers><p3><pl><acc>/זײ<prn><pers><p3><pl><dat>/זײ<prn><pers><p3><pl><nom>/זײַן<v><imp><sg>$

$ echo "זיי" | lt-proc yid.automorf.bin
^זיי/*זיי$
jonorthwash commented 5 years ago

@unhammer, in regards to the extra <>, it seems to have something to do with spellrelax, as it's replacing letters that are allowed in those contexts by spellrelax, e.g. here:

^האט/ה<>בן<v><imp><pl>/ה<>בן<vaux><pres><p2><pl>/ה<>בן<vaux><pres><p3><sg>/ה<>בן<v><pres><p3><sg>/ה<>בן<v><pres><p2><pl>$
mr-martian commented 2 years ago
[17:27:56] <popcorndude> oh, this is so dumb
[17:28:03] <firespeaker> oh?
[17:28:23] <popcorndude> lt-comp stores tags without the <>
[17:28:30] <firespeaker> wat?
[17:28:37] <popcorndude> for space reasons
[17:28:59] <popcorndude> but that leads to the assumption that any multichar symbol in .bin is a tag
[17:29:12] <popcorndude> lt-comp on a .att will take of the initial and final <>
[17:29:19] <popcorndude> and lt-proc will put them back on
[17:30:11] <popcorndude> so output of <> is you had a 2-codepoint symbol, lt-comp took off the < and > without checking that they actually were < and > leaving a symbol of length 0
[17:30:25] <popcorndude> then when outputting that symbol, lt-proc added back the < and >
[17:30:29] <popcorndude> leaving <>