cldf / pycldf

python package to read and write CLDF datasets
https://cldf.clld.org
Apache License 2.0
15 stars 7 forks source link

Check command #109

Closed xrotwang closed 4 years ago

xrotwang commented 4 years ago

Basic CLDF checks, e.g. whether languoids in Glottolog's Bookkeeping family are linked to from a dataset, should be available as (pluggable?) cldf check command.

See also https://github.com/lexibank/pylexibank/issues/182

SimonGreenhill commented 4 years ago

Maybe we should collapse the various “check_x” functions into just one pluggable “check” command

xrotwang commented 4 years ago

I'm trying to get a clearer idea of what should go where right now. The main principles I see so far:

xrotwang commented 4 years ago

So rather than check_languages and check_concepts in pylexibank, we might want to have (and run in buildbot):

xrotwang commented 4 years ago

I'm not sure, what to do with the "column without values" check. While this would be checkable for any CLDF dataset, it may only make sense for lexibank, where such columns would be an indication of a buggy makecldf implementation. For arbitrary CLDF datasets such columns may rather be indicative of using default component metadata, and only filling in what's available.

xrotwang commented 4 years ago

It seems a good idea to consolidate checks now! Just discovered that the bookkeeping checking already happens in cldfbench check (which also runs cldf validate and pytest ...). So to avoid burning cycles on the buildbot unnecessarily, we must consolidate!

xrotwang commented 4 years ago

So here's my proposal:

SimonGreenhill commented 4 years ago

Yes, I think consolidating and moving as many to the 'lower' levels as we can makes sense (cldf <- cldfbench <- lexibank).

And some of these might make more sense in the validate() command too? (e.g. surely citation and title should not be empty in valid CLDF?)

Can we make it discoverable so that whichever check command is used, will run all the checks? (i.e. so we don't need to run all three check commands but lexibank check runs cldf and cldfbench too)

As for 'empty columns', I think that can go or be used to check specific fields only? I can see lots of valid reasons to have empty columns and warnings here would just add noise.

Finally, do we want to do internal checks (e.g. outdated pylexibank? outdated glottolog?) or just rely on buildbot updating everything? (what happens if setup.py pins pylexibank at v0.9 or something? how do we guard against that?)

SimonGreenhill commented 4 years ago

Do we need to construct a 'broken' cldf for testing purposes?

xrotwang commented 4 years ago

I don't really like the idea of lexibank check running all the lower level checks as well. This seems to go against "explicit is better than implicit". Imagining buildbot running on all sorts of datasets I think the following is most transparent:

addStep('cldf check')
if self.cldfbench_curated:
    addStep('cldfbench check')
if self.org == 'lexibank':
    addStep(cldfbench lexibank.check)
xrotwang commented 4 years ago

Regarding checking metadata in cldf validate: While I agree that citation and title should be part of the metadata of a dataset, the CLDF spec does not mandate it - and validate should only check spec conformance, I think. So, maybe it should be part of cldf check. But then, whether metadata should be part of the dataset - rather than distributed together with the dataset (think of the metadata provided by Zenodo for archived datasets) - is not entirely clear to me. (It certainly is a chicken-egg problem to include the DOI, assigned to a released dataset, in the payload of the release ...).

So, I'd say metadata checks should be limited to cldfbench-curated datasets, where we know that there's a metadata.json in the repository.

SimonGreenhill commented 4 years ago

Ok to both! 👍

xrotwang commented 4 years ago

Regarding internal checks and "buildbot updating everything": I think we're not yet at the point where buildbot would actually update stuff, i.e. push results of makecldf to github - and possibly even trigger a release. So right now, releases are still a manual thing - and buildbot only makes sure that we know in advance, the release will work (also on a different machine) and be valid. As long as this is the case, the guard against outdated pylexibank etc., is the manual process in a maintainer-local environment.

xrotwang commented 4 years ago

I fleshed out cldf check to the extent possible (I think):

$ cldf check -h
usage: cldf check [-h] [--glottolog GLOTTOLOG]
                  [--glottolog-version GLOTTOLOG_VERSION]
                  [--concepticon CONCEPTICON]
                  [--concepticon-version CONCEPTICON_VERSION]
                  [--iso-codes ISO_CODES]
                  DATASET

Check the content of a valid CLDF dataset
- Is any of the tables declared by the dataset empty?

Column checks:

- LanguageTable
  - iso639P3code: Is the ISO code valid? (requires "--iso-codes")
  - latitude: Is the latitude between -90 and 90?
  - glottocode: Is the Glottocode valid - is it in Bookkeeping? (requires "--glottolog")
  - longitude: Is the longitude between -180 and 180?
  - macroarea: Is the macroarea valid according to Glottolog? (requires "--glottolog")

- ParameterTable
  - concepticonReference: Is the concept set ID valid? (requires "--concepticon")

...

  --iso-codes ISO_CODES
                        Check ISO codes against an ISO 639-3 code table. Pass
                        a file name pointing to a download of https://iso639-3
                        .sil.org/sites/iso639-3/files/downloads/iso-639-3.tab
                        or "y" to download the code table on the fly.
                        (default: n)