UniversalDependencies / UD_Arabic-PUD

Parallel Universal Dependencies.
Other
2 stars 0 forks source link

which Arabic is this, please? #2

Open bansp opened 4 days ago

bansp commented 4 days ago

Hi all, I'm too much of a beginner at Arabic to be able to tell, and "ar" is obviously not enough either: is this dataset in MSA, please?

And, to cram an extra question on top: what does "original text" mean, please? I can spot a difference in diacritics here and there, both ways (mostly or only for the extra "n") -- is the "text" layer somehow normalised wrt an earlier rendering?

I did try to locate some documentation, by following the links, but that got me up to https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2184 , which doesn't answer my questions. If I need to simply RTFM on this dataset, I'll gladly do that, if someone cares to point me there, please :-)

dan-zeman commented 4 days ago

My understanding is that it is MSA. But it has been translated from other languages (English, German, French, Spanish, Italian). The diacritics appear, presumably, where the translators decided to include them. I don't know about any normalization step, although I cannot completely exclude that it happened when the data was processed by Google.

The Lindat link is useful only for obtaining the data but there won't be more documentation than there is in this GitHub folder (specifically, in the README file). I actually have a copy of the Google Arabic Syntax Annotation Guidelines – something we do not have for any other original PUD datasets that went through Google. I don't know though if it contains the answers you are looking for.