MUST/SHOULD in CDLF specs = ERROR/WARNING in validate?

Anaphory commented 2 years ago

Currently, cldf validate tests the ‘MUST’ elements of the CLDF specs as well as we manage, right? And it gives a warning and exits with code 1 if there are any complaints.

There are ‘SHOULD’ items in the specs which are not tested. The specific one that brought me here now is the shape of identifiers. Programming with IDs, primary keys, and foreign key targets, I have seen other cases that are not explicitly phrased like that, but might become recommendations in the future.

Do you think it would make sense to add those checks to cldf validate, and report them at a lower logging level than actual non-conformance? I would suggest implementing this by turning non-conformance into ERROR and using WARNING for non-recommended behaviour, but that may break peoples' assumption (including my own).

xrotwang commented 2 years ago

pycldf.Dataset.validate accepts a validators argument, specifying additional validators in the format (table, col, callable), see https://github.com/cldf/pycldf/blob/f150f88d3c303a50698cb029d926ba9c1cdc0f53/src/pycldf/validators.py#L31-L48 Maybe you can use this framework for specific scenarios - where a SHOULD is a MUST?

Anaphory commented 2 years ago

That's what I'm doing (or planning to do) as much as possible. A few of my constraints I don't see how to implemented as validators, because they need to check different columns in different tables.

I was just wondering whether it might make sense to report violated RECOMMENDATIONS also in cldf validate.

xrotwang commented 2 years ago

Could you give a specific example of a recommendation you'd want to see checked - and how the calling code should be made aware of failures.

Anaphory commented 2 years ago

The thing I struggle with most are the complexities of primary&foreign keys, references, and ID columns, which I brought up here and there from time to time recently.

I have some other suggestions which are not talked about in the CLDF standard yet, but which I think might make good recommedations in the long run (“The elements of the alignment SHOULD match the segments of the form or be '-' – modulo morpheme annotations in segment slices” and such things.)

I'm not sure how to address it on a caller level, outside of log messages like I mostly use CLDF validate right now. I can think of a few options, but I don't know their advantages or disadvantages.

Instead of PASS/FAIL, a validator could return a set of constraints violated, and an empty return value would be a “pass”. This would work best if the constraints had persistent URLs, and the list of violations would just be a list of such URLs. I would look at flake8 and other linters to see whether they have good techniques for this.
Like a logger, a validator could have a level it is called at, for example MUST, SHOULD, and implied_should. A logger returns a FAIL if the level is lower than its own level and the data does not fulfill its constraints, otherwise a PASS no matter whether the data fulfils the constraint or not.

I think the second approach feels very rigid.

xrotwang commented 2 years ago

Hm. We are not really close to a PR here, I guess :) To add another perspective: Since the referential consistency checks may be quite time consuming, I thought about outsourcing these to a relational database. I.e. replace these checks with

loading the dataset into SQLite
pass through SQLite referential integrity errors

So while "more" validation might be desirable, "less" (or preferably quicker) validation is another requirement.

Given these somewhat conflicting ideas, I'd suggest to leave Dataset.validate as is, and think about additional sets of validators - and maybe a way to pass these into Dataset.validate.

Anaphory commented 2 years ago

Ah, speed is a very good point.

Some of my check ideas might indeed also benefit from a relational database approach, as opposed to caching in dictionaries, which is what I resort to now.

If I find any nice ideas while implementing my extended tests, I'll come back here with something more concrete.

xrotwang commented 1 year ago

Can be reopened, if corresponding PR is under review.

cldf / pycldf

MUST/SHOULD in CDLF specs = ERROR/WARNING in validate? #146