First steps towards a CG-based UD parser; point to the lexicon-proofreading-effort in the docs; some corrections in puupankki

IlnarSelimcan commented 4 years ago

Note that adding the dependency labels / arcs to the out put kaz-tagger and kaz-disam (as currently is the case) breaks translators, which means that this PR shouldn't be merged right now. But there are also other reasons why it's not ready yet (as discussed above).

jonorthwash commented 4 years ago

Note that adding the dependency labels / arcs to the out put kaz-tagger and kaz-disam (as currently is the case) breaks translators

Is this something that would benefit from secondary tags?

IlnarSelimcan commented 4 years ago

Note that adding the dependency labels / arcs to the out put kaz-tagger and kaz-disam (as currently is the case) breaks translators

Is this something that would benefit from secondary tags?

Yep, I believe so (assuming that dependency labels will be declared as such).

jonorthwash commented 4 years ago

@khannatanmai, see above about secondary tags.

khannatanmai commented 4 years ago

Yep, secondary tags will be ignored by the pattern matching FSTs so adding secondary tags wouldn't break anything. I'm currently on transfer since making the tags pass transfer is proving to be harder than I thought, but I'll let you know when I get to the tagger and stuff.

IlnarSelimcan commented 4 years ago

I have to admit that I'm still catching up with the discussion of the secondary tags, so what I say here might be not in line with what is the planned in the project at all, but still...

What had occurred to me is that -- for not breaking apertium-X-Y -- marking tags as secondary in apertium-X will be based on the knowledge of apertium-Y. This means that apertium-X and apertium-Y will essentially remain "coupled". Imho ideally developer of apertium-X wouldn't have to care about what apertium-Y expects, and wouldn't bother marking tags as 'main' or 'secondary'. Developer of apertium-X-Y (with the knowledge of both apertium-X and apertium-Y) will say: apertium-Y expects this, so cherry pick this, this and this tags from the stream.

That's how it probably would be done if using a CG for transfer. Whether it is possible with a finite-state transfer, I don't know.

UPDATE: Well, in short, the question is where to declare some of the tags as secondary? That is, in monolingual package or bilingual package? Adjusting the transfer module -- making it robust to the secondary tags present in the stream -- is required either way, so that should be a big step forward.

apertium / apertium-kaz

First steps towards a CG-based UD parser; point to the lexicon-proofreading-effort in the docs; some corrections in puupankki #16