Re-arranging the major data sets for CLTS

LinguList commented 6 years ago

As of now, I propose a rather radical relaunch, which, however, does not break the tests. We now distinguish

TS, as a transcription system, as we know it
TD, as transcriptiondata,
SC, as sound class systems (they are a mix between transcription-system and transcription-data, as they can generate unknown sounds)

What is important is that I also changed the way that we produce a transcription dataset. We start here by putting a file into sources, where we have to specify at least two columns: BIPA and GRAPHEME. BIPA serves for one important purpose: if we add a sound in BIPA, it means we explicitly say that the sound in the respective TD should be interpreted as such. E.g., if you check for sources/ruhlen.tsv, you find:

>>> ruhlen = TranscriptionSystem('ruhlen')
>>> ruhlen['tʃ']
'č'
>>> ruhlen.data['tʃ']
[{'bipa_grapheme': 'tʃ',
  'count': '969',
  'features': '',
  'generated': '',
  'grapheme': 'č',
  'image': '',
  'latex': '',
  'note': '',
  'sound': '',
  'url': ''}]

That is, because we explicitly linked the two sounds.

From now on, we can manually re-link data in sources, and different versions may have more sophisticated links. As we can already definitely define sounds in the transcription-systems, we can now also do so in the transcription data.

I also added a class "", which contains all the data that is not linked inside a given dataset. Similar to concepticon.

An open question is how to indicate the differences:

we have explicitly mapped a sound manually (č vs. tS in Ruhlen)
we have automatically mapped a sound and this sound is regularly occuring in our definitions of BIPA
we have automatically mapped a sound but this sound is not regularly occurring in BIPA, thus, it has the "generated" attribute set to "+"

I think we should distinguish these three levels, but I'm not yet sure how to do best.

LinguList commented 6 years ago

Addon: to create a TD from the sources, you have to type:

$ clts td

LinguList commented 6 years ago

I just adjusted the data accordingly, proposing a fix for #88.

cldf-clts / clts-legacy

Re-arranging the major data sets for CLTS #89