cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

Re-arranging the major data sets for CLTS #89

Closed LinguList closed 6 years ago

LinguList commented 6 years ago

As of now, I propose a rather radical relaunch, which, however, does not break the tests. We now distinguish

What is important is that I also changed the way that we produce a transcription dataset. We start here by putting a file into sources, where we have to specify at least two columns: BIPA and GRAPHEME. BIPA serves for one important purpose: if we add a sound in BIPA, it means we explicitly say that the sound in the respective TD should be interpreted as such. E.g., if you check for sources/ruhlen.tsv, you find:

>>> ruhlen = TranscriptionSystem('ruhlen')
>>> ruhlen['tʃ']
'č'
>>> ruhlen.data['tʃ']
[{'bipa_grapheme': 'tʃ',
  'count': '969',
  'features': '',
  'generated': '',
  'grapheme': 'č',
  'image': '',
  'latex': '',
  'note': '',
  'sound': '',
  'url': ''}]

That is, because we explicitly linked the two sounds.

From now on, we can manually re-link data in sources, and different versions may have more sophisticated links. As we can already definitely define sounds in the transcription-systems, we can now also do so in the transcription data.

I also added a class "", which contains all the data that is not linked inside a given dataset. Similar to concepticon.

An open question is how to indicate the differences:

I think we should distinguish these three levels, but I'm not yet sure how to do best.

LinguList commented 6 years ago

Addon: to create a TD from the sources, you have to type:

$ clts td
LinguList commented 6 years ago

I just adjusted the data accordingly, proposing a fix for #88.