[Web Representation] Rendering CLTS in CLLD

LinguList commented 6 years ago

I was giving some time to understand how CLLD works today, and I succeded at rendering some basic aspects, but I realized that at some points, it will just be faster if I learn directly from @xrotwang on how to do it, and that we need to decide on the basic underlying model, as well as the different views we want to have on the data.

Here's a really rough account on my current thinking:

we want to have the clts/parameters to answer to our bipa[sound].name-schema, so that people can link to the resource
currently, BIPA lists some 700 sounds explicitly, but we can use it to generate more than 3500 different sounds we encounter in TranscriptionData, so each release should give a table with all agglomerated sounds interpreted correctly (according to our evaluation principles) with BIPA
all additional transcription systems should also try to render the master set of sounds, where possible
we currently list a rather limited set of metadata, which could be expanded in the future:
- grapheme (almost all TD have it)
- frequency (we could add this from phoible and other TD as well)
- url (most TD have it, but not all)
- sound (only one TD, that is: the wave/mp3 file for illustration)
- image (again only one, but maybe useful for illustration, where we have it)
- features (very useful, also one major asset for future expansion by adding feature sets of authorities or whomever)

The question is: how to proceed? This somehow would also be amenable to csvw, as we could create csv-files with json metadata explaining the type of dataset, so that it eases deployment in CLLD. The parameters-view should be minimal, I guess, showing only:

grapheme (in BIPA)
name (which is the identifer)
representation (in terms of TS and TD which have the respective sound)
aliases (for BIPA, has to be generated, as a comma-separated list, but maybe, it could also be assembled for all datasets, and it would include the non-alias grapheme itself, and link to all datasets which have this specific grapheme but might give it another intepretation, but I don't know to which degree this is easy to implement in CLDD)

I would call this view the "Sounds" or "Sound Segments".

Then we could list all "Graphemes" as values, where we show (this is licensed by the linguistic definition of grapheme which allows for a grapheme to contain multiple letters see here):

the grapheme
the name (i.e., the clts-identifier)
the contribution (i.e., the dataset where it occurs)
the dataset-type (i.e., TS or TD, which is a crucial distinction)
whether it's an alias or not

Note that the mapping between "grapheme" and "name" per datset is n to 1: each grapheme has one name, but the same name may have several graphemes in a given dataset.

Then, we would have the contributions, similar to concepticon, where we list what sources helped to create a given TS or TD. Here, I can think of the following items:

identifier (i.e., "bipa", "phoible")
description (as I started to give here)
references (bibtex-keys)
number of graphemes in the original data (including aliases)

The value for a given Grapheme in a given dataset would list the additional data, like frequency of representation, URL, features, etc.

LinguList commented 6 years ago

The easiest way to produce this data seems to me to add a dump() command to the commands of clts, which would create the relevant datatables in csv-format (ideally with proper csvw-metadata). To keep links, we would further need to add identifiers among the values, so:

Sound Segments (or Sounds) links to the values with all reflexes for a given name
Graphemes links to the Sounds via the name and the datasets via the contribution
Contributions links to the references

I think, I can easily start to draft the dump() command that keeps these inter-relationships in the data and also tests for consistency.

LinguList commented 6 years ago

Okay, thanks to the experiments with clld, I think I see the model a bit clearer now:

parameters are our sound segments
values are sound classes, transcription data, and transcription systems, and an additional "data type" column in the datatable representation shows what kind of data the respective values belong to
sound classes have but one grapheme as "answer" to the parameter
transcription data has potentially more, as a default, we show only the frequency, and maybe a link to the dataset
transcription systems have also only one value
contributions are the different datasets, a contribution has a type (sound class system, transcription data, transcription system), and n>1 sources
sources are the sources
contributors should be renamed as "authors" or "editors"

All in all, this seems to be the best way to proceed. The values/ will be quite a few, as we have some 30 000 unique values for all data types together, but this view is not the most important one.

The code needs to be adjusted accordingly (the __main__.py.dump-command currently separates the values into three classes).

LinguList commented 6 years ago

we're advancing on this, so we can close the issue for the time being.

cldf-clts / clts-legacy

[Web Representation] Rendering CLTS in CLLD #85