TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
279 stars 88 forks source link

Requirements for New Sanity Checker #1991

Open hcayless opened 4 years ago

hcayless commented 4 years ago

Roma used to have a feature that would allow you to check the "sanity" of a constructed schema. This is a desirable thing and we want to figure out how to re-implement it rather than attempt to resurrect the ancient and long-broken PHP implementation that used to do this. This ticket is intended to capture requirements for a new sanity checker.

The tool should at a minimum check that included features in a schema are reachable (e.g. no elements that are not referenced in any content models).

jamescummings commented 4 years ago

In addition to object isolation as @hcayless mentions above, the new system might:

That's what occurs to me off to of my head.

ebeshero commented 4 years ago

Noting from our F2F discussion that this will likely involve some Schematron and XSLT work, and we might want to organize a group of us working on this optimally before the next Roma release, though this work is separate from Roma itself.

sydb commented 4 years ago

I overall like James’ list, with a few concerns.

  • warn if not well formed and valid ODD

There is no point in checking well-formedness, Roma, the tool we are discussing, OxGarage, the Stylesheets … nothing will work if it is not well-formed, and anything you do will tell you if it is not. Checking validity … sure, against what?

  • warn where Schematron rules are no longer applicable but still included

How in BLEEP are we supposed to check whether a rule is applicable or not? (We can test if it is outdated or not by looking at @validUntil, if it has one. But applicable? What does that even mean?)

And add:

Note that quite a few of the items listed are checked by the tei_customization schemas.

hcayless commented 4 years ago

A further note on James's list:

warn if required TEI elements aren't available like titleStmt

Will be difficult absent some intrinsic attribute of "requiredness" on TEI elements. That is, it would require that we add some property to the formal definition of the TEI or that the sanity checker know things about TEI that TEI does not know about itself.

This starts to get into notions of "clean" (or not) customizations which are problematic.

sydb commented 4 years ago

Well, yes in theory, but in practice you can just list the absolute requirements. (That is what tei_customization does.) They are <TEI>, <teiHeader>, <fileDesc>, <titleStmt>, <title>, <publicationStmt>, and <sourceDesc>. That list is not likely to change much in any rapid way at all.

That leaves out conditional requirements, of course. E.g. that you have something that can go inside <sourceDesc> . Or e.g. that if you have a <text> in your schema (which you might not — you might only be interested in <sourceDoc> or <standOff>), then you must have a <body>. (The opposite, BTW, is not true: you can have a <body> without a <text>, as <body> can go inside <floatingText> which might occur in a <note> somewhere inside <sourceDoc> or <standOff>.)

ebeshero commented 4 years ago

This is reminding me of a case-in-point: One of my former students was writing an ODD customization and inadvertently left out the textstructure module so had no TEI root element. It took us a while to realize how she'd done that. I think this sanity checker should be useful to anyone writing ODDs (whether using the Roma or not).

sydb commented 4 years ago

Agreed, @ebeshero , but to be fair, tei_customization would have caught that.

ebeshero commented 4 years ago

@sydb This was years ago...not sure if tei_customization was around at that point, but anyway she wasn't using it.

ebeshero commented 4 years ago

More to the point, do we expect people generating an ODD with Roma not to have access to tei_customization?

sydb commented 4 years ago

Sorry, @ebeshero , my head got spun by the double negated expectation. But we expect everyone in the entire world to have access to tei_customization, at least in RelaxNG.

ebeshero commented 4 years ago

@sydb Okay, let's turn that into a positive question: Should the first stage of sanity checking simply be to associate the tei_customization RNG, so then any further sanity checking we design should be complementary to this? I see that there's a handy WWP blog post about how TEI members can access this, as well as your article from 2019 (and I think I remember your presentation), but I'm not at all sure it's widely known as yet. Can we make it more prominent as part of this effort?

sydb commented 4 years ago

@ebeshero you be taking words out of my mouth! Seriously, I would not be surprised if there were a few constraints in tei_customization that we did not want, but something similar to it is probably right on target.

laurentromary commented 4 years ago

I guess you keep in mind that the sanity checker should not be too aggressive with specifications aiming at defining non TEI vocabularies, or even worse those reusing only some specific TEI crystals in other vocabularies. Thus, checking the presence of specific TEI header element definitions should only occur when these elements are actually used.

jamescummings commented 4 years ago

I agree with @laurentromary where above I've said check this or that TEI-specific thing, that should only happen in files in the TEI namespace.

sydb commented 4 years ago

Actually, no, @laurentromary. I did not imagine performing a sanity check on anything other than a TEI customization file. Thus I think it should be quite aggressive. If you are writing your own language in ODD, you would need your own sanity checking, no?

laurentromary commented 4 years ago

If you guaranty that the sanity checker would not fire in a place where I would use an non-TEI-intended-ODD then I don't care. Probably a roadmap with a test implementation would allow us to see if we open cans of worms anywhere.

raffazizzi commented 4 years ago

I think there are two levels of checking here:

  1. Is the internal organization of an ODD consistent? E.g. plucking a few from @jamescummings 's list above:

    • warn if not well formed and valid ODD
    • warn of any unused classes
    • warn if any specGrpRef point to things which don't exist
    • warn if schemaSpec start attribute contains an element that is not available in the ODD
    • warn if explicit content models include elements or classes that are not available
    • etc.
  2. ONLY for TEI customizations:

    • warn of any teidata.enumerated attributes for which valLists have not been supplied (the TEI recommends projects provide these)
    • warn if required TEI elements aren't available like titleStmt
    • etc.

For the latter case, we could even just rely on using Oxgarage to validate via tei_customizations and show a report.