Strange lines in eng.tagged corpus

apertium / apertium-eng

Apertium linguistic data for English

GNU General Public License v3.0

10 stars 50 forks source link

Strange lines in eng.tagged corpus #20

Open AMR-KELEG opened 5 years ago

AMR-KELEG commented 5 years ago

I am currently using the texts/eng.tagged file for testing the new weighting algorithms. While using the file, I noticed that it has some lines with just a single double quotation character! (Example: https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged#L823)

^the/the<det><def><sp>$
"
^golden/golden<adj>$
^axe/axe<n><sg>$
"
^competition/competition<n><sg>$

Should these lines be fixed? I don't want to handle it in my script if it's a bug in the tagged corpus and I believe fixing these lines is just a simple find and replace command that any text editor can do easily.

unhammer commented 5 years ago

I'm guessing the analyser didn't have " in alphabet nor any analysis of " – in those cases, lt-proc will simply output the symbol as-is without wrapping it in ^"/"…$.

If you want to handle the apertium stream format, you should expect to see this kind of thing all the time. You could use http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream/ to get the relevant stuff out:

$ echo '^foo/bar<fie>$ " [hippopotamus] \["^ga/ga<ga>$'|apertium-cleanstream -n

^foo/bar<fie>$

^ga/ga<ga>$

unhammer commented 5 years ago

(speaking of, we should probably get apertium-cleanstream into https://github.com/apertium/apertium/ )

AMR-KELEG commented 5 years ago

(speaking of, we should probably get apertium-cleanstream into https://github.com/apertium/apertium/ )

I should open an issue there, shouldn't I?

unhammer commented 5 years ago

(speaking of, we should probably get apertium-cleanstream into https://github.com/apertium/apertium/ )

I should open an issue there, shouldn't I?

That'd be nice :)