expath / xpath-ng

Wishlist for XPath Syntax Extensions
Creative Commons Attribution 4.0 International
12 stars 4 forks source link

Proposal for Restricted Sequences #11

Open rhdunn opened 6 years ago

rhdunn commented 6 years ago

This adds the tuple sequence type concept to SequenceType.

rhdunn commented 6 years ago

I think it is useful to have both a named tuple type based on maps like you are suggesting and have implemented in Saxon, and a sequence based tuple type.

Regarding the examples I have mentioned in this proposal, some of them may be better represented as named tuples (e.g. the point and complex/rational numbers). It depends on the library implementor on what design choices they make, like the choice of sequence or array (both of which have different advantages and disadvantages).

In addition to being able to better type-check existing code that makes use of tuple sequences (like the MarkLogic math:modf example), a vendor may choose to optimise these in assembly -- either specific instructions (like the sincos example) or help vectorise operations it knows can be packed into SSE and related types (such as a matrix library).

Having to specify these functions as named/map tuples would make it harder to convert to efficient assembly instructions.

Regarding the disadvantages you mention:

(a) I don't see being limited to single items is a disadvantage here -- the examples I gave in this proposal do not have variable numbers of items;

(b) Again, I don't see only being able to refer to the items by position instead of name a disadvantage here, especially when you allow the possibility of optimising the code the query runs;

(c) Again, I don't see being able to replace items as a disadvantage -- they are an immutable value, so the modified version will be a new value returned from the expression/function. For example:

declare function swap(
    $pair as (xs:anyAtomicValue, xs:anyAtomicValue)
) as (xs:anyAtomicValue, xs:anyAtomicValue) {
    $pair[2], $pair[1]
};
michaelhkay commented 6 years ago

One other disadvantage of representing complex numbers (say) as sequences is that you then can't have a sequence of complex numbers. I agree that for very simple tuples such as complex numbers, positional access to components works just as well as named access. But (a) named access is more scaleable to more complex problems, and (b) if you're going to use positional access, then surely using arrays rather than sequences gives much more flexibility.

rhdunn commented 6 years ago

These are all design decisions that a developer writing XPath/XQuery code will need to make regardless of whether or not this proposal is accepted. This proposal does not change the behaviour of sequences with regards to flattening. It is designed to provide better static checking in two cases:

  1. When the number of items in a sequence differs from what the function is providing/accepting, and what the user is expecting/providing;
  2. When the types of the items in the sequence differ (like in the math:modf example).

In the current XPath/XQuery syntax, a user can already specify the common type of each item in a sequence and the cardinality (zero or one, zero or more, only one, one or more). This proposal is just expanding on that so an XPath/XQuery processor can provide better static checking.

michaelhkay commented 6 years ago

I think that with the XQuery 1.0 / XPath 2.0 data model this would be a useful enhancement, but I think the use cases for it are obsoleted by maps and arrays.

There's an ambiguity in the grammar that needs to be sorted out. In 3.1, (xs:integer) is a legal instance of ItemType, and under this proposal it becomes a legal instance of SequenceType. You could argue that isn't a problem because it has the same semantics whichever way you parse it, but we shouldn't have this kind of ambiguity in the EBNF. One solution is to require a ParenthesizedSequenceType to contain at least two ItemTypes, but that still leaves you needing indefinite lookahead to resolve the ambiguity (indefinite, because an ItemType can be arbitrarily long).

adamretter commented 6 years ago

I am not opposed to this, but I wonder if we need to so far as adding explicit types - https://github.com/expath/xpath-ng/pull/8#issuecomment-432582158

rhdunn commented 6 years ago

@michaelhkay I'm open to having a different syntax that does not conflict. Maybe something like:

let $c as xs:double+ sequence of (xs:double, xs:double) := (1.0, 2.0)

I'm thinking also in terms of supporting another proposal for union item types:

let $json-doc as ( function(*) union of (map(*), array(*)) )? := load-json("test.json")

This would have the advantage of allowing sequence types in the union of syntax, should be easy to parse without infinite lookahead, and not have the ItemType restriction from ParenthesizedItemType. I'll leave the discussion of the union of syntax to a separate proposal, or is it worth including here?

It may also be possible to drop the initial sequence type:

let $c as sequence of (xs:double, xs:double) := (1.0, 2.0)
let $json-doc as union of (map(*), array(*))? := load-json("test.json")

Using your predicate type syntax, these would be something like:

declare function math:modf($v as xs:double) as xs:anyAtomicValue+[
    count(.) = 2 and
    .[1] instance of xs:integer and
    .[2] instance of xs:double
] external;

let $c as xs:double+[ count(.) = 2 ] := (1.0, 2.0)

let $json-doc as function(*)[ . instance of map(*) or . instance of array(*) ]? :=
    load-json("test.json")
michaelhkay commented 6 years ago

For a union of sequence types it would be neat to reuse the SequenceTypeUnion syntax from typeswitch -- specifically

SequenceTypeUnion ::= SequenceType ("|"  SequenceType)*

Because a SequenceType can appear in "treat as", it has to follow some syntactic constraints and this is best achieved by sticking to the "keyword(details)" convention. I think the term "union" is strongly associated with XSD union types, so I would go for the syntax "anyOf(" SequenceTypeUnion ")"

rhdunn commented 6 years ago

I like the idea of the anyOf (and correspondingly for this, a sequenceOf) syntax. One consideration is that currently there are no keyword tokens in any of the XPath or XQuery specifications (core, full text, updating, scripting) that separate words using capital letters -- instead, they use - like in boundary-space, attribute-or-self, and empty-sequence. As such, I would propose using any-of and sequence-of, i.e.:

"any-of" "(" SequenceTypeUnion ")"
rhdunn commented 6 years ago

I've updated this proposal based on feedback. The text for this version is at https://github.com/expath/xpath-ng/blob/4cac38e54646fe6486b85b2635168d681a5fba5a/restricted-sequences.md.

I'll add a separate proposal for the any-of construct once I have updated my other proposals.

michaelhkay commented 6 years ago

Thanks for this. Sorry if the comments are fairly lengthy, but it tends to be the case that the closer you get to agreement in principle, the more reviewers start noticing the edge cases.

Comments on latest version. Substantive comments where further technical work (rather than editorial work) is needed highlighted in bold.

  1. With the new sequence-of syntax, there is no need to restrict the number of coomponents to one-or-more. sequence-of(X) with a single component is equivalent to X on its own, but one should not prohibit a particular case just because it's not useful. (The orthogonality argument would also extend to allowing sequence-of() as a synonym of empty-sequence()). And in fact these are potentially useful, e.g. if you're generating types from another type model that itself allows sequences of length 0 or 1, then your code doesn't have to treat these cases specially. (Note: you have defined the grammar to allow one or more components. I think that's probably fine.)
  2. Your grammar allows an optional "?" after the list of component item types, but gives no clue what this means.
  3. It's not technically correct to say that this "does not define a new type". It does define a new family of types, but the value space of these types is entirely subsumed by that of existing types. So it doesn't extend the value space of XDM instances, but it does allow you to carve out a subspace that can't be described using 3.1 types.
  4. "If a restricted sequence is assigned a flattened sequence that contains a different number of items than is specified by the restricted sequence type, a type error is raised." That's the wrong kind of language to use: this spec isn't defining what particular operations like "assignment" do (in fact, there's no such operation); rather it is defining the rules for the value space of the type. Something like: A sequence S is an instance of a restricted sequence type T if and only if (a) the number of items in S is equal to the number of item-type components in T, and (b) for every $N in 1 to count(S), the item S[$N] is an instance of the $Nth item-type component in T. These rules should extend the existing text in XQ 2.5.5.1.
  5. "If the types of two restricted sequence types are not compatible (see the judgement subtype(A, B) definition below) a type error is raised." Again, this is out of place. Defining the subtype relation is adequate; the rules for what happens when two types satisfy or don't satisfy the subtype relation appear elsewhere in the spec.
  6. "An XQuery processor may infer a restricted sequence type from an untyped sequence during the static analysis phase." I'm afraid I don't even know what this is trying to say (what is an "untyped sequence"?), but I strongly suspect that it's not necessary to say it.
  7. "Use cases". Your section would be better entitled "Justification" or "Rationale", and could do with expansion. A light expansion might be: "The proposal allows the type of a sequence to be further constrained. This (a) makes code more understandable by documenting the interface between functions; (b) enables improved diagnostics as a result of more precise type checking of function arguments and function results; (c) enables processors (if they wish) to perform more precise static type checking and to use the extra information for optimization.
  8. I think there's a need to specify the impact on the function conversion rules. For example, if the expected type is sequence-of(xs:integer, xs:integer) and an sequence of two xs:untypedAtomic values is supplied, are they converted? (I think they should be; similarly atomization and type promotion from integer to double).
  9. As regards the subtype(A,B) relation:
    • Please note bug https://www.w3.org/Bugs/Public/show_bug.cgi?id=30302 which I found on reading the existing specification text.
    • There's a structural problem in that the table no longer enumerates all the possible kinds of SequenceType, and extending it to do so becomes increasingly difficult. However, I'm inclined at this stage to extend it anyway, by adding a new row and column. But presentation issues aside, let's look at the substance of the new rules.
    • Given that the existing table has six rows and six columns, we need to cover 13 new cases to create a 7x7 table; specifically, (i) subtype(A,B) where A is a restricted sequence type and B is any of the other six categories; (ii) subtype(A, B) where B is a restricted sequence type and A is any of the other six categories; and (iii) subtype(A, B) where A and B are both restricted sequence types.
    • For case (i), the six subcases are (1) B is empty-sequence(): true if and only if A has zero component item types; (2) B is xs:error - always false; (3) B is an item-type b with occurrence indicator: true if (I) every item type a in A satisfies itemType-subtype(a, b), and (II) the number of components in A is acceptable to the occurrence indicator of B, where N is acceptable to I if (3a) I is absent and N is 1, (3b) I is "?" and N is zero or one, (3c) I is "+" and N is one or more, (3d) I is "* and N is anything.
    • For case (ii), subtype(A, B) is true if and only if subtype(B, A) is false
    • For case (iii), subtype (A, B) is true if A and B have the same number of component item types, and for each pair of corresponding item types Ai, Bi, itemType-subtype(Ai, Bi) is true.
    • In the above, I just realised that I failed to consider the case where one of the item types in the restricted sequence type is xs:error. Any such sequence type is equivalent to xs:error, in that it has no instances, and we could define the subsumption rules by reference to the rules for xs:error.
michaelhkay commented 6 years ago

Perhaps the subtype(A,B) rules can be refactored as follows.

A sequence type constrains:

(a) the permitted cardinality of the sequence, as a (possibly infinite) set of non-negative integers. For empty-sequence(), the permitted cardinality is {0}, for xs:error the permitted cardinality is {}, for any other item type it depends on the occurrence indicator: {1}, {0..infinity}, {1..infinity}, or {0,1}; for a restricted sequence type it is the number of component item types.

(b) the required item type of the Nth item in the sequence. For empty-sequence() and xs:error this is xs:error (for all N), for a restricted sequence type it is the Nth component item type, for an item type with occurrence indicator, it is the item type (for all N).

subtype(A,B) is true if and only if both of the following are true:

(a) the permitted cardinality of A is a subset of the permitted cardinality of B

(b) Let Ai be the required item type of item i in A and let Bi be the required item type of i in B; for all i in the permitted cardinality of A, itemtype-subtype(Ai, Bi) is true.

Of course, because this involves infinite sets, this doesn't provide a computable algorithm for determining subtype(A, B), but I believe it provides an adequate mathematical specification.

rhdunn commented 6 years ago

Thanks for your detailed response. Regarding point 6, that was intended for things like:

let $x := (1, 2.0)

here $x has no explicit type. An XQuery processor may infer the type of $x as either item()* (if it does not support inferring types), xs:decimal+ or similar (if it only supports general cardinality/type inherence), or sequence-of(xs:integer, xs:decimal) (if it supports inference of restricted sequences).

Would it make sense to write up something like below as a separate proposal (so the new sequence type and subtype judgement semantics are defined in a single place), or document the parts separately in the different proposals?

subtype judgement

I wonder if it makes sense to revise how these rules are structured to account for all the new types. Maybe adapting and extending something like https://github.com/rhdunn/xquery-intellij-plugin/blob/master/docs/XQuery%20IntelliJ%20Plugin%20Data%20Model.md#214-part-4-sequences, which I have been documenting to define how I am intending on approaching static analysis in my plugin. This would have the advantage of removing the subtype table and making the rules more like the subtype-itemtype rules.

Those rules define Sequences as Part 4 of the type system (section 2.7.4 of the XQuery and XPath Data Model) as a lower bound, upper bound, and item type, where the upper and lower bounds form the cardinality. I then define 3 type operations:

  1. item type union -- This handles both schema union types (your union proposal), and generalised item type union (an any-of proposal). NOTE: I am using union type in that text to refer to the XML Schema union type.
  2. sequence type union -- This is intended for static analysis in my plugin for computing the resulting type from multiple possible sequence types (e.g. the then and else branches of an if expression)
  3. sequence type addition -- This is intended for static analysis in my plugin for computing the type of a flattened sequence.

This proposal could then be implemented such that the upper bound of the resulting sequence type is the number of ItemTypes specified in the subtype, and the ItemType of that sequence is the item type union of the specified ItemTypes. The lower bound would be 0 if ? was used after sequence-of(...), otherwise it would be 0.

michaelhkay commented 6 years ago

Yes, I understand what you're saying about inferring types, but it doesn't need to be said. It's no different from inferring a type for let $x := 3, and the specification doesn't say anything about that; it doesn't need to.

Yes, in reviewing your other proposal I also came to the conclusion that it's worth refactoring the section on subtype judgements. Looking at your IntelliJ page, you've defined the cardinality constraint in terms of a range of integers, whereas I did it as a set of integers; the advantage of using a set of integers is that it allows the empty set, which is what xs:error gives you. With "restricted sequence types" we have the new concept that the required item type becomes dependent on the position of the item in the sequence, and I think the best way to handle that is to make it this dependency a general property of sequence types, and then make the dependency trivial for existing sequence types.

michaelhkay commented 6 years ago

Note that once we establish the idea that a SequenceType determines a required item type for the Nth item in a matching sequence, we can use this concept in the function conversion rules.

Actually this isn't as easy as I first thought. Suppose the required type is sequence-of(xs:integer, xs:double) and the supplied value is @readings and the type annotation of @readings is list(xs:positiveInteger). To make this work, we first have to atomize the node @readings, giving a list of two xs:positiveInteger values, then we have to promote one of these to xs:double.

But what are the conditions under which we atomize, given that the required type could also be sequence-of(xs:integer, attribute())? One approach is to say that we only atomize if all the component item types are (generalized) atomic. Messy, but probably workable.

rhdunn commented 6 years ago

@michaelhkay I have updated the proposal to address your suggestions. See https://github.com/expath/xpath-ng/blob/b9800b9882ded23d812641ef82209cc59861d6cc/restricted-sequences.md for the non-diff version of the latest changes (version 3 of the proposal).

I have updated the subtype(A,B) judgement to use our combined terminology as we were thinking more or less along the same lines, and use your more precise cardinality notation using sets. The main difference with my XQuery IntelliJ plugin text is to make the upper/lower bounds optional to support xs:error (i.e. representing the empty set), and to allow any value for the upper bound (allowing the restricted sequences in this proposal).

I haven't looked at your last comment regarding attributes and atomization yet.