cldf / pycldf

python package to read and write CLDF datasets
https://cldf.clld.org
Apache License 2.0
15 stars 7 forks source link

improper warning about multivalued codeReference column #164

Closed nataliacp closed 1 year ago

nataliacp commented 1 year ago

We have declared a separator (;) for the codeReference column in values.csv but we are getting a warning when running cldf validate. The warning is: WARNING http://cldf.clld.org/v1.0/terms.rdf#ValueTable http://cldf.clld.org/v1.0/terms.rdf#codeReference must be singlevalued

is this expected behavior? thanks!

xrotwang commented 1 year ago

It's expected by pycldf: https://github.com/cldf/pycldf/blob/afc2cb0887fdb9f31b1c5c32c2fa7ad7b28ddab4/src/pycldf/components/ValueTable-metadata.json#L40-L46

However, I realize that this constraint is not specified in the ontology. So at this point, I'd say it's an omission in the spec.

Conceptually, though, what would it mean for one datapoint to fall into multiple bins? If the idea is that different constructs of one language fall into mulitple bins, I'd say multiple ValueTable rows would convey this more clearly.

xrotwang commented 1 year ago

Btw. "multiple Values for the same (language, parameter) pair" is the solution chosen by APiCS and it's supported by tools like cldfviz map

nataliacp commented 1 year ago

thanks for the quick reply. These are variables that have a list of states, such as which parts of speech receive a particular plural suffix. We had adopted the solution of the separator, as per our previous discussion here https://github.com/cldf/cldf/issues/109#issuecomment-831903639 But I understand the logic of your other proposal too. I will review all such cases in our dataset and will report back if there is anything that conceptually would be a different case.

xrotwang commented 1 year ago

Overall, I think list-valued foreign keys are often problematic, because they make it impossible to attach more data to the relation. E.g. in the APiCS case, values for a particular code also carry a Frequency property. In the list-valued codeReference case there would be no place for this - other than adding another list-valued property, and the implicit assumption that order in both lists is significant and relates Frequency and Code ...

nataliacp commented 1 year ago

I understand. I had a look at APiCS maps and they look great for our purposes too! So, we are going to split the multi-values in different rows. And now I have a follow-up question. Is frequency for the pies that are displayed calculated automatically by clld (when one wants an equal split) or it is specified in the values.csv (e.g. in https://apics-online.info/parameters/43#2/30.3/10.0 there are equally-split pies and more real frequency-like pies).

xrotwang commented 1 year ago

Pie slices are computed from frequency values. See https://clldutils.readthedocs.io/en/latest/svg.html#clldutils.svg.pie e.g.:

>>> from clldutils import svg
>>> print(svg.pie([20, 80], ['#aa0000', '#00aa00']))
<svg  xmlns="http://www.w3.org/2000/svg"
      xmlns:xlink="http://www.w3.org/1999/xlink" height="34" width="34">
  <path d="M17.0,17.0 L1.0,17.0 A16.0,16.0 0 0,1 12.1 1.8 L17.0,17.0" style="fill:#AA0000;stroke:none;" transform="rotate(90 17.0 17.0)"></path><path d="M17.0,17.0 L12.1,1.8 A16.0,16.0 0 1,1 1.0 17.0 L17.0,17.0" style="fill:#00AA00;stroke:none;" transform="rotate(90 17.0 17.0)"></path>
</svg>
xrotwang commented 1 year ago

If you wanted to use number of values per code as weight, you'd just use 1 as frequency for each value.

nataliacp commented 1 year ago

thank you very much Robert! this is very helpful. We will try to make it work like Apics.