acoli-repo / acoli-dicts

3000+ machine-readable open source dictionaries distributed by the Applied Computational Linguistics lab at the University of Augsburg, Germany, and by the research group Linked Open Dictionaries (LiODi, funded 2015-2020 by BMBF at Goethe University Frankfurt, Germany). All data provided in OntoLex-Lemon and TIAD-TSV.
Apache License 2.0
10 stars 2 forks source link

Revise graph compilation #11

Open chiarcos opened 3 years ago

chiarcos commented 3 years ago

Update /stable/dicts-w-legend.gif and /dicts/dicts-w-legend.gif. Revise scripts (stable/scripts/build-dict-graph.sh, stable/scripts/build-dict-graph-incl-exp.sh) such that they use only the statistics in the files langs.tsv and lang-pairs.tsv that each data set should provide (issue #10). For classifying language codes, this is currently hard-wired in both these scripts. Create mapping file in scripts (/stable/scripts/langs.tsv) with the following tab-separated columns:

TAG NAME GROUP AFFILIATION

with TAG: BCP-47 tag (primary language tag only, ignore everything after -) NAME: name according to ISO 639-3 GROUP: major language group or geographic region AFFILIATION: major language group (free text) or other comments

e.g.,

en English GERMANIC Germanic, Indo-European zh Chinese EAST_ASIA Sino-Tibetan

Current set of GROUPs: Indo-European (different shades of grey): GERMANIC CELTIC ROMANCE ITALIC SLAVIC BALTIC IRANIAN INDIAN (incl. Romani) OTHER_IE (Albanian, Greek, Armenian, Anatolian, Tokharian, etc.) Note that Pidgins and Creoles are classified along with the language they derive from, e.g., English-based Creoles like English (but mark that under AFFILIATION). Note that artifical languages based on European languages, e.g., Esperanto, are not considered Indo-European.)

other languages (different colors) AFROASIATIC (called SEMITIC in the script) ALTAIC (Turkic, Monolic, Tungusic, excluding Korean and Japanese) URALIC DRAVIDIAN CAUCASIAN (NE, SW, NW Caucasian) PACIFIC (native languages of Australia, Papua-New Guinea, Austronesian, incl. Malagasy) SUBSAHARIC (native languages of Africa, excluding Afroasiatic and immigrant languages such as Malagasy) EAST_ASIA (languages of Eastern and Southern Asia that are neither Indo-European, Dravidian, Austronesian nor Altaic) AMERICA (native languages of North or South America)

Along with this, unclassified exist (e.g., artificial languages, Basque, Sumerian, Elamite, etc.)

Note that the current set of GROUPs is not meant to be linguistically adequate, but its different levels of granularity (coarse-grained geographic region, macro-family, language family) only reflect the composition of the dataset. With the mapping table separated from the code, this is the first step towards implementing a more linguistically adequate classification.