Help needed in updating the eng.tagged corpus - Githubissues

apertium / apertium-eng

Apertium linguistic data for English

GNU General Public License v3.0

10 stars 50 forks source link

Help needed in updating the eng.tagged corpus #21

Open AMR-KELEG opened 5 years ago

AMR-KELEG commented 5 years ago

I have found that some tags are marked as unknown * despite getting analysed by the compiled dictionary.

Theses cases can be discovered easily but I need help in manually inspecting them.

The tagging doesn't seem to be that easy as for example: The token bloody is located in lines 11 and 11145 https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged#L11 https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged#L11145

line	analysis
11	`^bloody/*bloody`
11145	`^bloody/bloody<adj><sint>$`

What do you think is the better way to fix such cases?

ftyers commented 5 years ago

For the weighted automata project, the best way is to just ignore these errors. Your code should just discard/skip invalidly encoded words.