Open Abhishek-P opened 5 months ago
ISO 15924 provides codes suitable for such a metadata item. There are probably finer distinctions that could be made about the spelling rules in the treebank, but those would be difficult to capture systematically, and ISO codes of scripts would be an improvement over no info (current status).
I can make a change in the sa
treebanks pages using the ISO 15924 for a start.
Let me know if and where this needs to be discussed and documented (templatized) for future work.
I did a quick check of the treebank comparison pages (those linked in home page) sa
is the only case I caught of treebanks having different scripts.
I did a quick check of the treebank comparison pages (those linked in home page)
sa
is the only case I caught of treebanks having different scripts.
This is true at the moment as far as I know, but there are other languages that could use multiple writing systems, so it is definitely a property of the treebank rather than the language.
I can make a change in the
sa
treebanks pages using the ISO 15924 for a start. Let me know if and where this needs to be discussed and documented (templatized) for future work.
I will raise this at a future meeting of the core group. Assuming there won't be objections, these are the next steps:
Script: Latn
for Latin-based alphabets).This is true at the moment as far as I know, but there are other languages that could use multiple writing systems, so it is definitely a property of the treebank rather than the language.
Another candidate is UD_Egyptian, which uses Schenkel transcription rather than hieroglyphs or Gardiner codes, either of which would be conceivable for Egyptian.
If it should be automated, it can be tricky to find code for this (as script is ambiguous when searching for code), here are some existing solutions (last one by me, optimized for speed not RAM):
https://github.com/cisnlp/GlotScript https://robvanderg.github.io/scripts/scripts/
This is a case I came across when using UD Sanskrit (https://universaldependencies.org/sa/index.html) treebank(s).
The two treebanks use two different scripts, UFAL uses devanagari while Vedic uses latin.
I suspect this maybe true for some other languages (yet to do an audit) (I am also assuming other such cases, script mixing is not a case we have to worry about with this)
Currently, the tree bank page does not provide any explicit information about the script - although this can be inferred from the examples in the morphology overview section.
I think it would be nice to have that information surfaced more clearly in the treebank page since this can be an important tree bank characteristic to keep in mind for certain uses.
My preliminary proposal for it would be to add it as part of the description similar to Genre with a hyperlink to a scholarly source on scripts (scriptsource?)