digling / cddb

Chinese Dialect Database
GNU General Public License v3.0
16 stars 4 forks source link

CDDB: Chinese Dialect Database

This database aggregates all kinds of linguistic information on Chinese dialects, ranging from lexical datasets, via lists of character readings in ancient and contemporary varieties, up to proposed classifications. The database is handled with help of a Python library that tests whether the data is correctly encoded and handles its consistency. The Python library itself has some dependencies which are currently being developed, including the concepticon api, parts of the lexibank-api, the cross-linguistic phonetic alphabet, lingpy, and the currently not yet released CLICS api for network manipulations as well as sinopy. To edit the data, the edictor tool will be used.

Data Types in CDDB

Currently, I envision to include the following datatypes (for which exemplary datasets already exist):

How Data will be Added

Data will be added in an ad-hoc manner: If I realized that certain data is available, I'll start trying to figure out how to add it, be it by typing it off myself (see dataset Liu2007), or by looking for sources where it was published in a digitally accessible manner. The new source will be added to the references, and a folder containing the data will be created in the folder datasets/ in the repository. But the appearance of the data may change, as errors will be corrected and at times new aspects will be added (enhanced cognate judgments, etc.). Since the data will be in flux to some degree, releases will help to fix certain stages of the data, and make sure users can employ one version that is physically stored and given a DOI at zenodo.