UniversalDependencies / UD_Ancient_Greek-PROIEL

Ancient Greek data from the PROIEL project.
Other
4 stars 2 forks source link

To what extent do the annotations match that of Perseus? #1

Open AngledLuffa opened 9 months ago

AngledLuffa commented 9 months ago

Looking through the conllu real quick, it's evident that the xpos don't match. Are the annotation guidelines for other columns, such as lemmas, upos, or dependencies mostly the same between treebanks, though?

One of our Stanza users ran into an issue where the models trained from this data don't properly process , or ., likely because there are zero examples of either in this dataset. I'm not sure whether trying to find some way to compensate for the punctuation in this dataset, switching to Perseus (which has some of the same annotators), or mixing the two datasets together would be the best solution.

https://github.com/stanfordnlp/stanza/issues/1311

Thanks in advance.

martinpopel commented 9 months ago

see https://aclanthology.org/2023.udw-1.2.pdf by @fjambe and @dan-zeman

AngledLuffa commented 9 months ago

That's excellent, thanks for the link. I will look into applying it to Latin when we have time. Do you know if any of the findings apply to the Ancient Greek family of models?

On Mon, Nov 27, 2023, 1:22 PM Martin Popel @.***> wrote:

see https://aclanthology.org/2023.udw-1.2.pdf by @fjambe https://github.com/fjambe and @dan-zeman https://github.com/dan-zeman

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_Ancient_Greek-PROIEL/issues/1#issuecomment-1828637475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMJMOKEHGR37WPYFH3YGUACDAVCNFSM6AAAAAA74THU72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRYGYZTONBXGU . You are receiving this because you authored the thread.Message ID: @.*** com>

dan-zeman commented 9 months ago

As far as I know none of the PROIEL treebanks in UD contain punctuation. It exists in the original texts but not in tree structure and it is not exported to UD.