acoli-repo / acoli-dicts

3000+ machine-readable open source dictionaries distributed by the Applied Computational Linguistics lab at the University of Augsburg, Germany, and by the research group Linked Open Dictionaries (LiODi, funded 2015-2020 by BMBF at Goethe University Frankfurt, Germany). All data provided in OntoLex-Lemon and TIAD-TSV.
Apache License 2.0
10 stars 2 forks source link
dataset dictionary open-source rdf translation

ACoLi Dicts

Large-scale, machine-readable bilingual dictionaries provided by the Chair for Applied Computational Linguistics (ACoLi) at the University of Augsburg, Germany. As a technical basis, we employ OntoLex-Lemon (https://www.w3.org/2016/05/ontolex/) for data modelling, OLiA (http://purl.org/olia) for representing grammatical information, lexvo (http://lexvo.org) for ISO 639 language identifiers and GlottoLog (http://glottolog.org) for identifiers of non-ISO-639 language varieties.

At the moment, we provide OntoLex-lemon and TIAD-TSV editions of open source dictionaries for more than 400 language varieties and more than 2500 language pairs (stable and experimental), with more than 3000 lexical data sets in total, see statistics below. Note that we exclude most smaller data sets (with less than 10,000 translation pairs) in these counts. Additional data has been converted, but is still awaiting copyright clearance.

The data is currently maintained by the Chair for Applied Computational Linguistics (ACoLi) at the University of Augsburg, Germany. The initial development of this data took place 2015-2022 at the Applied Computational Linguistics lab of Goethe University Frankfurt, Germany, in the context of the BMBF-funded research group Linked Open Dictionaries (LiODi, 2015-2022) and in collaboration with the Institute for Empirical Linguistics at Goethe University Frankfurt. The LiODi project aimed at creating Linked Open Data representations of dictionaries and the development of an infrastructure and methodologies for their practical application in language contact studies, mostly in Eurasia and the Caucasus area.

dictionary graph

Overview

  languages lexical data sets license OntoLex/RDF data TIAD/TSV data comments
Apertium 46 55 GPL apertium/apertium-rdf-2019-02-03 (*.rdf.zip) apertium/apertium-rdf-2019-02-03 (trans*tsv.gz) modeling based on http://linguistic.linkeddata.es/apertium/, designed for machine translation
FreeDict 45 145 GPL freedict/freedict-rdf-2019-02-05 (*/*.ttl.gz) freedict/freedict-rdf-2019-02-05 (*/*.tsv.gz) plain word lists, user-generated content
DBnary 119* 275* CC-BY-SA 3.0 external dbnary/dbnary-tiad-2019-02-16 * counted only language pairs with 10,000+ entries, user-generated content
PanLex 194* 1651** CC0 panlex/panlex-20191001-csv-rdf panlex/biling-tsv * only language pairs with 10.000 entries; ** TIAD-TSV files
MUSE 45 107 CC-BY-NC 4.0 muse/muse-rdf-2020-06-12 muse-tsv-2020-06-12 machine-generated, high-precision wordlist
Wikidata * * CC0 https://www.wikidata.org (external) wikidata/wikidata-tsv-2020-06-24 * >400k translation pairs, > 90k language pairs, but very sparse
OMW 34 40* open source external omw/tsv * conservative estimate, restricted to combinations of OMW files with identical licenses
IDS 234* 792*,** CC-BY 4.0 ids/ontolex ids/tsv * counted only language pairs with >10k translations, ** TIAD TSV files
XDXF 51 107 GPL experimental/xdxf/xdxf-rdf-2019-02-22 (*/*.ttl.gz) experimental/xdxf/xdxf-rdf-2019-02-22 (*/*.tsv.gz) experimental
free-dict.de 2 1 "free" experimental/free-dict.de/free-dict-de-2020-01-02 (*.ttl.gz) experimental/free-dict.de/free-dict-de-2020-01-02 (*.tsv.gz) experimental (partial)
StarDict 32 130 "open"/"free" experimental/stardict/stardict-2020-01-04 (*/*.ttl.gz) experimental/stardict/stardict-2020-01-04 (*/*.tsv.gz) experimental (partial)
total 430 3143

subdirectories

acknowledgements, licensing and references

The ACoLi Dictionary Graph is maintained and continues to be developed at the Chair of Applied Computational Linguistics at the University of Augsburg, Germany. Prior to 2023, the ACoLi Dictionary Graph has been created at the Applied Computational Linguistics Lab at Goethe Universität Frankfurt, Germany since 2014 in the context of numerous research projects, including

To refer to the dataset as a whole in scientific publications, please refer to Chiarcos et al. (2020):

@inproceedings{chiarcos2020acoli,
  title={The ACoLi Dictionary Graph},
  author={Chiarcos, Christian and F{\"a}th, Christian and Ionov, Maxim},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={3281--3290},
  year={2020}
}

All datasets are published under open or non-commercial licenses. We put our RDF and TIAD-TSV editions are put under the same license as the underlying source data. For detailed acknowledgements and licensing of individual datasets see the respective subdirectories.