How to document script used for the data in treebank?

UniversalDependencies / docs

Universal Dependencies online documentation

http://universaldependencies.org/

Apache License 2.0

272 stars 247 forks source link

How to document script used for the data in treebank? #1032

Open Abhishek-P opened 5 months ago

Abhishek-P commented 5 months ago

This is a case I came across when using UD Sanskrit (https://universaldependencies.org/sa/index.html) treebank(s).

The two treebanks use two different scripts, UFAL uses devanagari while Vedic uses latin.

I suspect this maybe true for some other languages (yet to do an audit) (I am also assuming other such cases, script mixing is not a case we have to worry about with this)

Currently, the tree bank page does not provide any explicit information about the script - although this can be inferred from the examples in the morphology overview section.

I think it would be nice to have that information surfaced more clearly in the treebank page since this can be an important tree bank characteristic to keep in mind for certain uses.

My preliminary proposal for it would be to add it as part of the description similar to Genre with a hyperlink to a scholarly source on scripts (scriptsource?)

License: CC BY-SA 4.0

Genre: fiction Script: devanagari

dan-zeman commented 5 months ago

ISO 15924 provides codes suitable for such a metadata item. There are probably finer distinctions that could be made about the spelling rules in the treebank, but those would be difficult to capture systematically, and ISO codes of scripts would be an improvement over no info (current status).

Abhishek-P commented 5 months ago

I can make a change in the sa treebanks pages using the ISO 15924 for a start. Let me know if and where this needs to be discussed and documented (templatized) for future work.

Abhishek-P commented 5 months ago

I did a quick check of the treebank comparison pages (those linked in home page) sa is the only case I caught of treebanks having different scripts.

dan-zeman commented 5 months ago

I did a quick check of the treebank comparison pages (those linked in home page) sa is the only case I caught of treebanks having different scripts.

This is true at the moment as far as I know, but there are other languages that could use multiple writing systems, so it is definitely a property of the treebank rather than the language.

dan-zeman commented 5 months ago

I can make a change in the sa treebanks pages using the ISO 15924 for a start. Let me know if and where this needs to be discussed and documented (templatized) for future work.

I will raise this at a future meeting of the core group. Assuming there won't be objections, these are the next steps:

Document it next to other metadata in the Release checklist.
Add it to validation infrastructure (the line will be required and must contain valid values, e.g. Script: Latn for Latin-based alphabets).
Announce it to the mailing list and make sure that all treebanks have the new line in their READMEs. Also add it to the template used when creating new repositories.
Make sure that the script that generates treebank hub pages (at release time) copies this information to the pages.

amir-zeldes commented 5 months ago

This is true at the moment as far as I know, but there are other languages that could use multiple writing systems, so it is definitely a property of the treebank rather than the language.

Another candidate is UD_Egyptian, which uses Schenkel transcription rather than hieroglyphs or Gardiner codes, either of which would be conceivable for Egyptian.

robvanderg commented 5 months ago

If it should be automated, it can be tricky to find code for this (as script is ambiguous when searching for code), here are some existing solutions (last one by me, optimized for speed not RAM):

https://github.com/cisnlp/GlotScript https://robvanderg.github.io/scripts/scripts/