Orange-OpenSource / conllueditor

ConllEditor is a tool to edit dependency syntax trees in CoNLL-U format.
BSD 3-Clause "New" or "Revised" License
54 stars 17 forks source link

Alignment between text and syntactic tree #33

Closed Stormur closed 1 year ago

Stormur commented 1 year ago

Hi!

There seems to be an issue when editing the SpaceAfter value in the MISC field: the text is not readjusted accordingly, and so an alignment error ensues. But this should be no different from the manipulations of MWT and similar things, right?

jheinecke commented 1 year ago

this is the change I did on purpose (a while ago) to help the annotator to see errors when the # text is not coherent with the Space(s)After keys in MISC. Imagine a sentences automatically "annotated" using a parser which is gong to be corrected manually. If the parser missed more than one Space(s)After, the user corrects one of them. If ConlluEditor updates the # text field, the second "bad" Space(s)After will be gone (instead of correcting the bas MISC field, CE would have changed # text. For sentsplit and split I agrred, that here the annotators (should) know what theyr are doing, so adapting # text automatically seems the best option. But here I'm in doubt. What do you think?

Stormur commented 1 year ago

I admit I do not understand the issue well... do you have an example? Anyway, when correcting SpaceAfter, even if the text is not automatically recompiled, shouldn't there be a way to do it inside conllueditor?

jheinecke commented 1 year ago

You can edit the text by clicking on edit metadata.

As an example of what I meant above: Imagine a sentence be validated

# text = veni, vidi, vici
1       veni    venire  VERB    _       _       0       root    _       _
2       ,       ,       PUNCT   _       _       1       punct   _       _
3       vidi    videre  VERB    _       _       1       conj    _       _
4       ,       ,       PUNCT   _       _       3       punct   _       _
5       vici    vincere VERB    _       _       1       conj    _       _

the commas 2 and 4 are not preceded by a space, but tokens 1 and 3 miss SpaceAfter=No. With the current version CE tells you that there is an incoherence at the first comma. If you add the missing SpaceAfter=No, CE will tell you that there is still an incoherence at the second comma. However if CE adapted # text in function of the Space(s)After of the tokens, once you have added the first SpaceAfter=No to veni, CE would silently correct # text to # text = veni, vidi , vici and show that everything is fine. But now CE would have modified the original text to match with token 3 instead of insisting to checking token 3 and either add a SpaceAfter=No or manually modify # text. Since in many treebanks the # text line seems to be taken from a corpus and therefore should not be modified, I do not want to break this by an automatic adaptation.

Stormur commented 1 year ago

Ah, OK, it is clear. It is indeed an issue. I did not get the possible metadata change, but probably it is good as it is now, since I am thinking of occasional corrections, not systematic mass editing.

Thanks for the explanations!