cldf-clts / clts

Cross-Linguistic Transcription Systems
https://clts.clld.org
13 stars 3 forks source link

Add ID column to sounds.tsv #101

Closed xrotwang closed 3 years ago

xrotwang commented 3 years ago

Even though it is basically redundant, we should add an ID column to sounds.tsv, using the NAME with whitespace replaced by underscores. That's the ID we use for sounds in the web app, e.g. mid-long_unrounded_open_back_with_rising_tone_vowel, and that's what we'll ask people to use as CLTS_ID in CLDF datasets, see https://github.com/cldf/cldf/issues/92#issuecomment-819499466 So there shouldn't be any ambiguity here.

Maybe we should even turn https://github.com/cldf-clts/clts/tree/master/data into a CLDF dataset, which would make the relation of the data files clearer, e.g. integrating the references, allow for simpler consistency checks, and even serve to document the data using the CLDF metadata -> markdown functionality.

xrotwang commented 3 years ago

@LinguList What do you think? And if yes, release this still as 2.0.1 or rather 2.1.0?

LinguList commented 3 years ago

2.1.0 seems better then, right? It would probably also require an update to pyclts. And yes: an ID derived from the characters seems fine. I think we do NOT use _ so far in any of our features? So _ would be equivalent to whitespace. I think we modified this for CLTS 2.0, before we had occasional _ in our feature values. But this means maybe also, that we should add a test, to make sure no new features will take an underscore? Or could that be handled via CSVW/CLDF?

xrotwang commented 3 years ago

Hm:

$ csvcut -c VALUE -t data/features.tsv | grep "_"
with_downstep
with_extra-high_tone
with_extra-low_tone
with_falling_tone
with_global_fall
with_global_rise
with_high_tone
with_low_tone
with_mid_tone
with_rising_tone
with_upstep
LinguList commented 3 years ago

Okay, we NORMALIZED the usage of underscore and dash, we did not abandon it. But this shows that we need to discuss how to make the identifiers in CLTS online then, right?

xrotwang commented 3 years ago

The naming of feature values seems somewhat inconsistent. There's with- as well as with_. Considering that we already regard sound names as IDs, I'd say we leave them as is - and just make sure, feature values are prefix-free.

LinguList commented 3 years ago

wait, which version do you use? this looks like clst 1.X, as we have mad ethis more consistent in 2.0! All with-features are with- now.

xrotwang commented 3 years ago

Hm. I'm looking at data/features.csv in 2.0

LinguList commented 3 years ago

see here:

 "tone": [
   "with-downstep",
   "with-extra-high_tone",
   "with-extra-low_tone",
   "with-falling_tone",
   "with-global_fall",
   "with-global_rise",
   "with-high_tone",
   "with-low_tone",
   "with-mid_tone",
   "with-rising_tone",
   "with-upstep"
 ],
xrotwang commented 3 years ago

where's this from?

LinguList commented 3 years ago

clts/pkg/transcriptionsystems/features.json

xrotwang commented 3 years ago

hm. ok. Then it seems that the derived formats (pkg, data) are out of sync.

LinguList commented 3 years ago

isn't data/features.tsv created from pyclts? If so, we did not update it, or is there some break in pyclts that no longer updates it? Essential for python usage is the features.json

xrotwang commented 3 years ago

Hm. It was never updated - thus should have been removed, I guess.

$ git log data/features.tsv
commit ce2b79192fc34b2be683f266dbabfb2c79a11366 (tag: v1.3)
Author: xrotwang <xrotwang@googlemail.com>
Date:   Sun Oct 20 09:53:28 2019 +0200

    seed with data from cldf/clts
xrotwang commented 3 years ago

Same goes for data/clts.json.

xrotwang commented 3 years ago

So pkg/transcriptionsystems/features.json is the place where this data is edited, right? The authoritative copy. I still always have to wrap my head around the workflow in clts. I think it's weird to have all these directories where some content is generated and some is not. E.g. data/references.bib vs. data/graphemes.tsv ...

xrotwang commented 3 years ago

So the clts make_dataset command seems obsolete, too. Superseded by clts make_pkg.

LinguList commented 3 years ago

Not really, the clts make_dataset is important for a single dataset, if you add a new one, make_pkg makes them all again. For active working, this is useful.

The workflow is still not optimal, I agree, but we have made progress in consistency, I'd say. See specifically our clearer instructions here: https://github.com/cldf-clts/clts/blob/master/CONTRIBUTING.md

xrotwang commented 3 years ago

Ok, I see. I'll try to flesh out RELEASING.md to make it fully clear, what is generate from what.

When doing this, I'd like to flesh out clts dump as well, to put a CSV version of features.json into data so that this can be turned into a CLDF dataset, amenable to cldf validate.

xrotwang commented 3 years ago

Maybe each directory should have a README.md, which explains all files within the directory?

LinguList commented 3 years ago

Yes to the extension of dump! And regarding the README, I can provide these, but would only have time from Tuesday on, I am afraid.

xrotwang commented 3 years ago

I'll figure it out.

LinguList commented 3 years ago

Okay, I'll definitely have time to review.

xrotwang commented 3 years ago

Sometimes I really like CLDF: Just add "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#Generic" to your CSVW metadata, and "dc:source" and give a column a propertyUrl of ...#source and voilà, get consistency checking of your bibtex keys in a CSV column.

xrotwang commented 3 years ago

Also nice: With the new metadata->markdown functionality, we could add a data description like this:

CLDF Metadata: cldf-metadata.json

Sources: data/references.bib

property value
dc:conformsTo CLDF Generic

Table sources/index.tsv

CLTS is compiled from information about transcriptions and how these relate to sounds from many sources, such as phoneme inventory databases like PHOIBLE or relevant typological surveys.

property value

Columns

Name/Property Datatype Description
NAME string Primary key
DESCRIPTION string
REFS list of string (separated by ,) References data/references.bib::BibTeX-key
TYPE string CLTS groups transcription information into three categories: Transcription systems (ts), transcription data (td) and soundclass systems (sc).
URITEMPLATE string Several CLTS sources provide an online catalog of the graphemes they describe. If this is tha case, the URI template specified in this column can be expanded with relevant rows from data/graphemes.tsv to form full URIs linking to the source catalog.

Table data/features.tsv

The feature system employed by CLTS describes sounds by assigning values for certain features (constrained by sound type). The permissible values per (feature, sound type) are listed in this table.

property value

Columns

Name/Property Datatype Description
ID string Primary key
TYPE string
FEATURE string
VALUE string

...

LinguList commented 3 years ago

Yes, would be perfect, if we could make these data descriptions more transparent!

xrotwang commented 3 years ago

@LinguList could you review this https://github.com/cldf-clts/clts/blob/4206ec752d421bd8d3f71cf6900d7e36be298137/README.md ? Some things could be be streamlined, I guess. E.g. since we have the URI template for graphemes easily accessible in sources/index.tsv, it would be sufficient to pass the necessary properties of graphemes through from source to pkg to still be able to create the URL later. As far as I can see, that's only used for PHOIBLE anyway, but here, the ID column present in sources is dropped in pkg.

xrotwang commented 3 years ago

Last state https://github.com/cldf-clts/clts/blob/34bab4cf92a6d9e4c6712c8c59952045434ddd85/README.md#cldf-dataset I think that's super useful, in particular because we can also transparently add citation info. This info should be scraped from the data repos, though - e.g. from a metadata.json (or .zenodo.json - but that doesn't have the full citation?) - rather than being hard-coded in pyclts.

LinguList commented 3 years ago

Yes, I agree, this should be placed into some metadata.json. I left individual comments on the columns, explaining what they do. And I agree that this can be streamlined. It also takes me always a while to get back into the workflow, but knowing it has improved a lot, means we can even improve it further, I think.