Schematron / schematron

Schematron "skeleton" - XSLT implementation
MIT License
93 stars 45 forks source link

Optimization request: skip validating branch under invalid element #18

Open rjelliffe opened 7 years ago

rjelliffe commented 7 years ago

Request from Patrik Stellmann on schematron-love-in mail list. I think this would be useful as a selectable command-line optimization, but not the default.

Hi,

here's another feature I already implemented and successfully used for my real application.

Background: Currently when a rule is matched and the corresponding reports and asserts are handled the follow-up-action is to process the content (@follow-up = 'process-content'). However, there are two alternatives you could think of: check the next matching rule (@follow-up = 'next-match') or abort the validation of this item (@followup = 'skip-content').

Additionally to the attribute "follow-up" for the element "rule", I added an attribute "rule-follow-up" to the element "pattern", which is used as default for all containing rules.

The 'next-match' option has almost the same effect as putting each rule in its own pattern. However, there are still some differences:

  1. Using the skeleton implementation of schematron, each pattern equals an apply-templates to the complete input while the alternative results only in a single apply-templates - with more templates to check, though. Still I could reduce my validation time (2.5MB input xml file, 107 patterns) from 4.8 to 2.3 seconds.
  2. With single patterns the messages are ordered first by the rule and second by the position in the input. The alternative sorts first by the position in the input and second by the rule.

When collecting rules (not patterns) from an xml schema file this would also be the configuration of the default-pattern.

My use-case for the 'skip-content' is following: In our large xml document split into several files there are some with currently poor quality and, thus, several errors reported by schematron. No I've added a the ability to mark single topics as 'in revision'. When the topic itself is being validated all the containing errors are reported. But when the complete book is being validated for each such topic there is only a single message like 'Warning: Document xy is in revision - no more errors are reported'. The rule matching such topic (context = "/topic[@status = 'inRevision']") has @follow-up set to 'skip content' to avoid the validation of the content. Another use-case would be to suppress a complete pattern by defining a rule like "context = '/[]'" depending on some condition when the validation makes no sense.

Regards, Patrik

rjelliffe commented 7 years ago

Seems reasonable option.

PStellmann commented 7 years ago

This option might be difficult to implement for schematorn engines not based on xslt. So I provided possible alternatives in the two new issues #37 and #38 that would either not effect any other implementation at all (#38) or should be easily realizable in for any other implementation (#37) since the only require basic XPath evaluation which is already an essential part of Schematron.

rjelliffe commented 7 years ago

I don't see that there is any constraint on an XSLT implementation to only provide optimisaztions that other non-XSLT implementations can provide, is there?

On Sun, Feb 12, 2017 at 7:44 AM, Patrik Stellmann notifications@github.com wrote:

This option might be difficult to implement for schematorn engines not based on xslt. So I provided possible alternatives in the two new issues

37 https://github.com/Schematron/schematron/issues/37 and #38

https://github.com/Schematron/schematron/issues/38 that would either not effect any other implementation at all (#38 https://github.com/Schematron/schematron/issues/38) or should be easily realizable in for any other implementation (#37 https://github.com/Schematron/schematron/issues/37) since the only require basic XPath evaluation which is already an essential part of Schematron.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Schematron/schematron/issues/18#issuecomment-279174407, or mute the thread https://github.com/notifications/unsubscribe-auth/AX3VKZNnZNEU4-B9_eQAv_zafv14M_5Yks5rbh2hgaJpZM4LfBOM .

rjelliffe commented 7 years ago

I have added comments to #37 and #38.

Regards Rick

On Sun, Feb 12, 2017 at 7:44 AM, Patrik Stellmann notifications@github.com wrote:

This option might be difficult to implement for schematorn engines not based on xslt. So I provided possible alternatives in the two new issues

37 https://github.com/Schematron/schematron/issues/37 and #38

https://github.com/Schematron/schematron/issues/38 that would either not effect any other implementation at all (#38 https://github.com/Schematron/schematron/issues/38) or should be easily realizable in for any other implementation (#37 https://github.com/Schematron/schematron/issues/37) since the only require basic XPath evaluation which is already an essential part of Schematron.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Schematron/schematron/issues/18#issuecomment-279174407, or mute the thread https://github.com/notifications/unsubscribe-auth/AX3VKZNnZNEU4-B9_eQAv_zafv14M_5Yks5rbh2hgaJpZM4LfBOM .

PStellmann commented 7 years ago

Since the @followup requires a modification of the schema as well I was hoping this might become part of the spec as well - not just the skeleton implementation. And after thinking more about this feature I think it should be integrable in any other implementation fairly simple as well.

Nevertheless, having this feature in the skeleton implementation and, thus, in oXygen (at least soon) is actually enough for my personal requirements.

rjelliffe commented 7 years ago

If i implementnit, it would either be

Regards Rick

On 14 Feb 2017 02:12, "Patrik Stellmann" notifications@github.com wrote:

Since the @followup https://github.com/followup requires a modification of the schema as well I was hoping this might become part of the spec as well - not just the skeleton implementation. And after thinking more about this feature I think it should be integrable in any other implementation fairly simple as well.

Nevertheless, having this feature in the skeleton implementation and, thus, in oXygen (at least soon) is actually enough for my personal requirements.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Schematron/schematron/issues/18#issuecomment-279419444, or mute the thread https://github.com/notifications/unsubscribe-auth/AX3VKVYFodadM7LX3CshUKkaouAshQXXks5rcHLhgaJpZM4LfBOM .

rjelliffe commented 7 years ago

Oh, to be clearer, i am thinking that there are four or so related requirements for skipping rules, and it would be good to see if they can be amalgamated in some way, with a common approach.

FYI, I think that features relating to pattern chaining should build on the phase mechanism, features relating to document branch trimming should build on pattern/@document, and features relating to fail-fast and optimisation should not be part of the schema at all and done on commmand line. I dont know that this new thing belongs in any of those buckets, though...

Regards Rick

On 14 Feb 2017 06:47, "Rick Jelliffe" rjelliffe@allette.com.au wrote:

If i implementnit, it would either be

  • a different namespace, so that there is no schema change and it is clear it is an extension, or

  • a command line option

Regards Rick

On 14 Feb 2017 02:12, "Patrik Stellmann" notifications@github.com wrote:

Since the @followup https://github.com/followup requires a modification of the schema as well I was hoping this might become part of the spec as well - not just the skeleton implementation. And after thinking more about this feature I think it should be integrable in any other implementation fairly simple as well.

Nevertheless, having this feature in the skeleton implementation and, thus, in oXygen (at least soon) is actually enough for my personal requirements.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Schematron/schematron/issues/18#issuecomment-279419444, or mute the thread https://github.com/notifications/unsubscribe-auth/AX3VKVYFodadM7LX3CshUKkaouAshQXXks5rcHLhgaJpZM4LfBOM .

PStellmann commented 7 years ago

I very much like the idea of more clearly seperating the requriements. And I think that I have mixed these things quite a lot with my different issues.

But I think it is sometimes difficult to split (performance-)optimization and schema. Or at least the optimization could be much easiert to implement if the schematron rules would provide more imformation how to optimize it. For instance, any requirement of skipping rules can be handled by addition conditions to the xpath (something like [my:should-be-skipped(ancestor::*)]). But this is neither easy to optimize (by aborting the processing) nor is it comfortable to write or read in the schematron rules.

On the other hand, I think it should be possible to detect rules containing boolean variables (like /*[$isMyTopic]//*) and create XSLT that checks the variable $isMyTopic only once before the recursive apply-templates.

Nevertheless, I agree that we should try to find a common approach for skipping rules (which might make my previous idea obsolete). So as a (new) starting point the list of requirements I'm aware of is:

  1. Abort the validation on a node specified by XPath. - This might be applied to a phase, pattern or rule.
  2. Avoid the validation of a document. - Could be seen as a special case of 1.
  3. Abort the validation when a previous phase failed ( #22). - This might be specified as an XPath expression on a phase with a special function or variable.

Maybe you could add the requirements I've missed.

BTW: It is difficult for me to check if these requirements are related to phases or pattern/@document since I have no experience with these features. Is there some documentation beyond the spec for pattern/@document you could recommend?