acoli-repo / acoli-dicts

3000+ machine-readable open source dictionaries distributed by the Applied Computational Linguistics lab at the University of Augsburg, Germany, and by the research group Linked Open Dictionaries (LiODi, funded 2015-2020 by BMBF at Goethe University Frankfurt, Germany). All data provided in OntoLex-Lemon and TIAD-TSV.
Apache License 2.0
10 stars 2 forks source link

Consolidate statistics #10

Open chiarcos opened 3 years ago

chiarcos commented 3 years ago

Provide for every dataset (stable and experimental) a file langs.tsv and a file lang-pairs.tsv in the root directory of the data set.

Use the following structure:

langs.tsv: TAG<TAB>FILE&ltTAB>ENTRIES<TAB>LICENSE

TAG: primary BCP47 language tag, omitting subtags, e.g., en for en-US, etc. FILE: OntoLex RDF file, can be in a (zip or other) archive. A file within an archive should be separated from the archive path with : ENTRIES: number of lexical entries (i.e., number of lexical entry URIs) LICENSE: license acronym

example:

en ontolex/archive.zip:en/dict1.ttl 10000 CC-BY 4.0

Note that multiple dictionaries per language variety can exist.

lang-pairs.tsv: SRC<TAB>TGT<TAB>FILE<TAB>ROWS<TAB>SOURCES

SRC: source language tag (see TAG above) TGT: target language tag (see TAG below) FILE: TIAD-TSV file (see FILE above) ROWS: number of rows in FILE, i.e., translation pairs. FILE must not contain duplicates. SOURCES: one or multiple source files, should correspond with langs.tsv FILE entries such that the license can be recovered