cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

adding metadata: what is the best way to do it? #16

Closed LinguList closed 7 years ago

LinguList commented 7 years ago

We can now use a feature-bundle to distinguish one sound uniquely (we should be able to do that). The definition of a sound, however, the feature bundle, is in four columns: three base columns plus the columns with key:value-pairs. This is useful, but if we want to add meta-data, I am wondering, where to add it:

  1. should we extract a list of all "names" of the sounds currently present and define meta-data on a per-source basis (phoible, fonetikode, Ruhlen, pBase, some feature-sets, also sound classes),
  2. or should we use the four columns do handle this, making for each metadata-collection a large csv-file with four + n columns, n being the number of columns needed in the metadata?

I'd prefer 1, but if we take the Sound().name as the identifier, it is a bit dangerous that we generate it in a first place, yet if we don't generate it, we may mess up the order when adding new sounds. Currently, the way to create a name is by setting up an order of features in an array, but we may well add new features in the future, and we may even change the order, if people complain.

In any case: we need to set up an example for metadata, and metadata will exist along the following lines (judging from what is there in the literature):

  1. inventory datasets (phoible, pbase, fonetikode, Ruhlen)
  2. feature datasets (pbase has features for Chomsky and Halle and a few more, phoible also has features, which I'd like to have separate, so they can be easily accessed)
  3. sound classes (we'll need to add them, as only with sound classes, we can convert from a higher-resolution alphabet to a lower-resolution one, but we can just automate the first creation with lingpy and later refine)
  4. transcription systems (in the end, we can just treat this as a link, even if it's not handled as this by now)
LinguList commented 7 years ago

Maybe, it is possible, to keep the current machinery for CLTS as some kind of an explorer code to make it easy to expand the data, while setting up explicit collections of metadata in which we spell out the feature bundles and link to the four base types? This would then make the lookup in clts just require to use the meta-data, but if it fails, new sounds could be generated for lexibank, and then evaluated and added to the base list of sounds, similar to the way we do it in concepticon?

LinguList commented 7 years ago

I have this first proposal, by which:

In this sense, I think, apart from the JSON-metadata specifications on the different values and types of the metadata, we are good.