cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

[Web Representation] Rendering CLTS in CLLD #85

Closed LinguList closed 6 years ago

LinguList commented 6 years ago

I was giving some time to understand how CLLD works today, and I succeded at rendering some basic aspects, but I realized that at some points, it will just be faster if I learn directly from @xrotwang on how to do it, and that we need to decide on the basic underlying model, as well as the different views we want to have on the data.

Here's a really rough account on my current thinking:

  1. we want to have the clts/parameters to answer to our bipa[sound].name-schema, so that people can link to the resource
  2. currently, BIPA lists some 700 sounds explicitly, but we can use it to generate more than 3500 different sounds we encounter in TranscriptionData, so each release should give a table with all agglomerated sounds interpreted correctly (according to our evaluation principles) with BIPA
  3. all additional transcription systems should also try to render the master set of sounds, where possible
  4. we currently list a rather limited set of metadata, which could be expanded in the future:
    • grapheme (almost all TD have it)
    • frequency (we could add this from phoible and other TD as well)
    • url (most TD have it, but not all)
    • sound (only one TD, that is: the wave/mp3 file for illustration)
    • image (again only one, but maybe useful for illustration, where we have it)
    • features (very useful, also one major asset for future expansion by adding feature sets of authorities or whomever)

The question is: how to proceed? This somehow would also be amenable to csvw, as we could create csv-files with json metadata explaining the type of dataset, so that it eases deployment in CLLD. The parameters-view should be minimal, I guess, showing only:

I would call this view the "Sounds" or "Sound Segments".

Then we could list all "Graphemes" as values, where we show (this is licensed by the linguistic definition of grapheme which allows for a grapheme to contain multiple letters see here):

Note that the mapping between "grapheme" and "name" per datset is n to 1: each grapheme has one name, but the same name may have several graphemes in a given dataset.

Then, we would have the contributions, similar to concepticon, where we list what sources helped to create a given TS or TD. Here, I can think of the following items:

The value for a given Grapheme in a given dataset would list the additional data, like frequency of representation, URL, features, etc.

LinguList commented 6 years ago

The easiest way to produce this data seems to me to add a dump() command to the commands of clts, which would create the relevant datatables in csv-format (ideally with proper csvw-metadata). To keep links, we would further need to add identifiers among the values, so:

I think, I can easily start to draft the dump() command that keeps these inter-relationships in the data and also tests for consistency.

LinguList commented 6 years ago

Okay, thanks to the experiments with clld, I think I see the model a bit clearer now:

All in all, this seems to be the best way to proceed. The values/ will be quite a few, as we have some 30 000 unique values for all data types together, but this view is not the most important one.

The code needs to be adjusted accordingly (the __main__.py.dump-command currently separates the values into three classes).

LinguList commented 6 years ago

we're advancing on this, so we can close the issue for the time being.