cldf / cltoolkit

Toolkit for Processing Cross-Linguistic Data in CLDF
MIT License
3 stars 0 forks source link

Checking arguments of Features which make use of Concepticon and CLTS #21

Closed LinguList closed 3 years ago

LinguList commented 3 years ago

A danger in the current functions in cltoolkit.features.phonology and cltoolkit.features.lexicon is that we do not really check if the terms used there (I mean Concepticon Glosses or Feature descriptions) really match the actual data of an actual version in the Concepticon.

In order to make sure that new versions of the Concepticon do not render the respective cltoolkit features useless, it seems important to test the functions also in this regard.

What should be relatively easy to do is to check for the arguments in the function of a feature in a feature collection. Thus, FeatureCollection["ArmAndHand"].function.keywords["alist"] has Concepticon glosses which should all occur in Concepticon. Since all lexical features are currently constructed in this way, testing them, also one-by-one is not difficult.

For phonological features, it is probably best to make sure in individual test that the descriptions of the sounds used are not changing so far.

LinguList commented 3 years ago

This check works as far as I can tell, I already found a bad gloss.

xrotwang commented 3 years ago

I think the way to go about it would be to maintain a list of compatible catalog versions. Ideally, the tool would check whether each dataset was compiled against a compatible version before using it. While this seems a bit complicated, it's exactly the kind of functionality we wanted to make possible with the extended metadata in lexibank datasets.

LinguList commented 3 years ago

I would not know how to implement, but am definitely not against this.

LinguList commented 3 years ago

Two things I realized when just adding this simple test to tests/test_features_collection.py:

def test_concepticon(concepticon):
    valid_glosses = set([c.gloss for c in concepticon.conceptsets.values()])
    c = FeatureCollection.from_metadata(pkg_path / 'features' / 'features.json')
    for feature in c.features:
        if feature.module.endswith("lexicon"):
            for key, value in feature.function.keywords.items():
                if key in ["alist", "blist", "ablist"]:
                    for concept in value:
                        if concept not in valid_glosses:
                            raise ValueError("gloss {0} not in concepticon".format(concept))

First, it is important to have these things automated already during development, as I use Concepticon Glosses in alist, blist, and ablist in the partial functions which are best checked automatically against the current version of Concepticon for which the current cltoolkit version should apply. So this test is already useful.

Second, since I had to add a dummy concepticon to the test suite, and we have a dummy CLTS there as well, it means (and this does not only related to this package), that any update of CLTS and Concepticon would require that the respective repos-dummy is also updated, right? Obviously, we can write this into the RELEASING instructions, but if what you mean by checking against compatible catalog versions makes this more principled, this would be very useful.

xrotwang commented 3 years ago

Yes, the dummy catalogs in the test suite are basically the specification of the interface expected by the package.

xrotwang commented 3 years ago

Regarding implementation options: I'd lean towards (slightly ab-)using python extra requirements: Here's what these look like from within the package:

>>> import pkg_resources
>>> pkg = pkg_resources.get_distribution('cltoolkit')
>>> for r in pkg.requires(['dev']):
...     print(r.name, r.specs)
... 
attrs [('>=', '18.2')]
cldfbench [('>=', '1.2.3')]
cldfcatalog [('>=', '1.3')]
clldutils [('>=', '3.5')]
csvw [('>=', '1.6')]
lingpy [('>=', '2.6.5')]
pycldf []
pyclts [('>=', '3.1')]
pylexibank []
uritemplate []
flake8 []
twine []
wheel []

We could specify:

extras_require = {
    ...
    'cldf-catalogs': [
        'concepticon==2.5.0, ==2.6.0',
    ]
}

(Although I'm not sure we'd want to support all possible requirement specifier formats.)

LinguList commented 3 years ago

If we specify this with a given distribution, could we in that case also switch from explicit passing of a transcription system to fetching this transcription system from CLTS?

LinguList commented 3 years ago

But on the other hand: forcing users to one transcription system or one concepticon version is probably also too restrictive. After all, where code FAILS is in the features, not anywhere else. So the features rely on stable versions that are covered by a certain release. But loading a wordlist and doing something with it does not require this. Features could even be defined by users themselves.

xrotwang commented 3 years ago

I wouldn't check these things at package installation - as may be implied by putting this in setup.py. Instead, checking for compatible data could be done per feature. We wouldn't want to have features which rely on different catalog versions in the same package, right?

LinguList commented 3 years ago

No. Definitely not. And I just realized: "extras_require" is of course also permissive of using cltoolkit without extras.

xrotwang commented 3 years ago

I only wanted to reuse the semantics of setup's extra requirements to specify this info. Where and how this info is used is a different question.

LinguList commented 3 years ago

Ah okay, sorry for reading this literally.

xrotwang commented 3 years ago

Oh yes, that's the "abuse" part: This would be an extra that can't be installed. So the potential downside is that users trying pip install cltoolkit[cldf-catalogs] will run into trouble.

LinguList commented 3 years ago

My plan now would be:

  1. make the separation in the two wordlists #19, so we have some basis here
  2. make tests for features (I added some new features for sound symbolism, but we are close to be able to close this chapter then)

I'd put this in a PR, along with basic tests that make sure that CLTS and Concepticon glosses do not have obvious problems.

If you have time, should I then hand over to you after my PR, to look into what we just discussed?

xrotwang commented 3 years ago

ok