Open AMR-KELEG opened 5 years ago
I'm guessing the analyser didn't have "
in alphabet nor any analysis of "
– in those cases, lt-proc
will simply output the symbol as-is without wrapping it in ^"/"…$
.
If you want to handle the apertium stream format, you should expect to see this kind of thing all the time. You could use http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream/ to get the relevant stuff out:
$ echo '^foo/bar<fie>$ " [hippopotamus] \["^ga/ga<ga>$'|apertium-cleanstream -n
^foo/bar<fie>$
^ga/ga<ga>$
(speaking of, we should probably get apertium-cleanstream into https://github.com/apertium/apertium/ )
(speaking of, we should probably get apertium-cleanstream into https://github.com/apertium/apertium/ )
I should open an issue there, shouldn't I?
(speaking of, we should probably get apertium-cleanstream into https://github.com/apertium/apertium/ )
I should open an issue there, shouldn't I?
That'd be nice :)
I am currently using the
texts/eng.tagged
file for testing the new weighting algorithms. While using the file, I noticed that it has some lines with just a single double quotation character! (Example: https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged#L823)Should these lines be fixed? I don't want to handle it in my script if it's a bug in the tagged corpus and I believe fixing these lines is just a simple find and replace command that any text editor can do easily.