cldf / cldf

CLDF: Cross-Linguistic Data Formats - the specification
https://cldf.clld.org
Apache License 2.0
53 stars 17 forks source link

Specify ContributionTable component #102

Closed xrotwang closed 3 years ago

xrotwang commented 3 years ago

Many datasets are aggregations of multiple contributions, e.g. WALS aggregates "1-feature-StructureDatasets" contributed by the feature authors. Since CLDF also makes it easy to merge datasets, the "aggregation" use case gets more common.

To transparently model such a data structure - and to ground further assumptions about the data (see https://github.com/cldf-datasets/phoible/issues/32) - a ContributionTable and a corresponding reference property to be used in ValueTable should be specified.

xrotwang commented 3 years ago

Maybe it even makes sense to have "typed" contributions - e.g. "phoneme-inventory"?

LinguList commented 3 years ago

In fact, this use case also comes in handy where a lexical dataset is initially compiled from many resources which are available in CLDF form and then combined to form a derived dataset that is further analyzed and should be published as an aggregated dataset, e.g., because data were manually edited (cognate sets and the like). So far, aggregation is usually done before converting data to CLDF, but this would allow us to be more clear if we aggregate from CLDF datasets and want to publish them as a whole.

LinguList commented 3 years ago

I was thinking quite a lot about this: what the contribution table does is that it shifts the idea of "a language has an inventory" to "a source has an inventory" or "a scholar makes an inventory". The idea is that one person provides some set of measurements about what they think is one specific language. This is like the doculect / language, or language variety vs. language, right?

And -- sorry, but this was not entirely clear to me before -- in WALS the contribution is often one single data point, done by one scholar for many languages, so not quite what would qualify as "contribution" in the sense of a phoible inventory, rathrer, as you say, a one-feature dataset.

Since it also holds for lexical data and wordlists, that one cannot easily combine them from two contributions into one (we have tried but deliberately do not do this in CLICS, etc., since it is risky to mix alphabets), it seems that the contribution case would then be better also modeled for lexical data, at least when it comes to aggregators like clics and ids.

Since we have for the cases I know of then essentially two cases, those with contribution table, where a language variety/doculect is defined as one contribution that "reflects" a language, and "has" a set of values or forms, and those without (like all lexibank datasets coded so far), it would be good to be able to distinguish those cases, since I'd also have to access both datasets differently in pycldf: dataset.objects("ContributionTable") vs. dataset.objects("LanguageTable"), is this more or less correct?

This means, however, that we can also have lexical datasets with this model, and I ask myself, how or whether it would be useful

xrotwang commented 3 years ago

I think, CLDF can not do what you want - i.e. encode in a general way the "doculect" level. That rather seems to be already the first level of analysis. E.g. in the case of WALS there are many datapoints within one Contribution (i.e. for the same feature) based on multiple Sources - and it often is not undisputed whether these different sources actually refer to the same language/doculect. So whatever we do, determining the "doculect" level must be a conscious decision by the researcher (and probably reflected in per-dataset code) - and not something inferred from the data model (in particular from a set of tables where none is called "Doculect").

LinguList commented 3 years ago

I do not really want to define doculects, it is merely to put some analogy on it to understand it better. What I want, a alogies at the side, is that wordlists from different sources for porentially identical varieties are distinguished a d not aggregated, since this makes comoutational comparison impossible (in my experience). So would you not agree that the contribution is important for this?

xrotwang commented 3 years ago

Yes, of course, the contribution context can be an important factor for taking the decision about what constitutes the doculect level. I just wanted to highlight that this decision cannot be taken automatically.

xrotwang commented 3 years ago

@LinguList @bambooforest @SimonGreenhill @chrzyki could you review this: https://github.com/cldf/cldf/pull/105/commits/4764664672b3f085b2efd8a5d55f0cd68442fdb5