UniversalDependencies / UD_Irish-IDT

Irish data
Other
6 stars 7 forks source link

Many feature corrections based on QA scripts #154

Closed kscanne closed 1 year ago

kscanne commented 1 year ago

A bit of back story... I've written some scripts with the aim of improving POS tagging for Irish, as part of work with Fiontar (@Gaois and @michealjohnny). These scripts arose out of the QA work I did on this treebank at the beginning of 2021. The good stuff is here:

https://github.com/kscanne/grammatach/blob/main/grammatach/ga.py

The code is a mess, but basically the idea is to combine the UD tagger with rule-based constraints inspired by @uidhonne's tagger, with the advantage that the constraints can be expressed in a rich way by using the UD dependency relations. As a simple example, virtually all rules for initial mutation can be expressed in this formalism.

As a side-effect, I can take any existing treebank and display where the constraints are violated by the existing features. This PR is the first batch of corrections based on this. More to come. I'm hoping to generalize enough to handle pre-standard Irish as well so it can be applied to the treebank I started last year:

https://github.com/UniversalDependencies/UD_Irish-Cadhan/tree/dev

These changes should be uncontroversial, but I'd be grateful if anyone is willing to give things a quick sanity check.