Fix annotation and namespace issue

puetzk commented 4 years ago

Fixes #16, and a variety of other issues hit in processing my existing grammars to extract documentation

Support documentation (or any other annotations) on groups and references to a named define
use a: as the xmlns prefix for http://relaxng.org/ns/compatibility/annotations/1.0, but don't assume that the user has already defined it, and make sure to qualify the case where they defined namespace a=... to be something else too.
Follow the spec for whitespace handling in a:documentation
Fix combining |=, &= productions in the presence of annotations.
Fix handling of datatag and literals with possibly-namespaced type references.

puetzk commented 4 years ago

For now using the rule that the first datatype reference becomes the default datatypeLibrary for the whole <grammar>, and is therefore not referenced explicitly. trang uses this same rule. That minimizes changes to the output, but it does cause a few headaches (and at least one failing test, hence the draft status).

https://github.com/djc/rnc2rng/blob/2b2c3c7c7f6a59055e68aea53769028609b469e9/rnc2rng/serializer.py#L244 requires that the root and included file have chosen the same default datatypeLibrary, and with this rule that is not the case.

it's easy to fix, but I'm submitting the draft for an opinion on which way you'd prefer:

In a lot of cases, your output has already done many (though not all) of the RELAX NG simplification steps. e.g. you use <element><name ns="...">a</name></element> instead of <element name="prefix:a"/>, you write the redundant <group> tags, etc. So brevity doesn't seem to be a huge goal. serializer.py could just write the datatypeLibrary on every <data> and <value> element, and not set a <grammar>-level default. But this changes quite a lot of test outputs (in the trivial sense of adding datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" in a lot of places.
the root could force any already-chosen default down for any included file, and/or adopt any default that the included file picked if it hadn't yet picked itself (i.e. need for types prior to the include). This ensures consistency at the expense of having the included file alter how the parent serializes.
include could (IMO should) be translated from RNC into the RNG <include> element, instead of following the simplification to inline it all away. But then you have to write multiple output files, which doesn't really work with the current shape of dumping it all to stdout.
Solution 1, but only for elements in an include (anything from an include is written explicitly). Or, they are allowed to use a default from the root, but not to set a new one if it hasn't been chosen yet

Probably still more variations possible... all are basically equivalent in terms of what you end up with after simplification, of course.

puetzk commented 4 years ago

(Note the committer and author are different on your patches. I'm fine with that, but usually when I end up in that situation it's because I tried to change it after the fact and partially failed...)

The names are the same, but the dates differ, because I did a bunch of rebase/squashing to put things into logical chunks. I can reset that before I I move it out of draft.

puetzk commented 4 years ago

On the issue of multiple type libraries, I wonder if it would be hard to use a hybrid approach: use your approach 1 but only if the type library is not the default/first. What do you think?

It isn't really about multiple type libraries (though I did test that), as it is about handling both xsd:type and the built-in token/string types, and getting the namespaces right. RELAX has no concept of a default datatypeLibrary like it does for default namespaces (in fact RELAX NG doesn't even have prefixes, Only the compact syntax introduces that concept of prefixes, the XML scheme always just uses the full URLs, but might copy the datatypeLibrary attribute from an ancestor element during simplification if it's missing.

I suppose since xsd is magically predefined: https://relaxng.org/compact-20021121.html

In the initial environment used for the start symbol [...] and xsd is bound as a datatype prefix to http://www.w3.org/2001/XMLSchema-datatypes;

And cannot even be changed:

It is an error if the value of datatypePrefix is xsd and the the value of the literal is not http://www.w3.org/2001/XMLSchema-datatypes.

We could just run with that bit of magic; a grammar-level datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" gets printed if xsd:* was ever referenced, and otherwise nothing is set for on <grammar datatypeLibrary="...">. Any other datatypes URI (that is not http://www.w3.org/2001/XMLSchema-datatypes) is always written explicitly on <data> or <value>, and thus doesn't care what's in its ancestors.

That avoid most of the cosmetic changes to changed output, and I think it would match what previous version (which just ignored the datatypePrefix and had the needs['types'] thing) did except where the previous one was subtly incorrect (e.g. using string, and having it get turned into xsd:string.

And then that assert just gets removed, because they can only disagree about whether it needs to get printed (if either wanted to print it, it does), not about what value it can have (which might not be reconcilable).

I like that solution, will try to make it so.

djc commented 4 years ago

Ah, maybe the names are the same, but the emails are different? GitHub shows me two different identities. I never really look at the dates, so you don't have to fix them up on my account.

The type library solution you came up with sounds great!

djc commented 4 years ago

Hi @puetzk, are you still interested in finishing this up?

djc commented 2 years ago

Going to close this for now. Feel free to reopen (or open a different PR) if this is of interest.

djc / rnc2rng

Fix annotation and namespace issue #19