expath / xpath-ng

Wishlist for XPath Syntax Extensions
Creative Commons Attribution 4.0 International
12 stars 4 forks source link

Slicing (sequences and arrays) #22

Open michaelhkay opened 3 years ago

michaelhkay commented 3 years ago

I'm proposing an extension to range expressions so you can do

0 by 2 to 10 => 0,2,4,6,8,10
5 by -1 to 1 => 5,4,3,2,1

And then a slice() function so that

slice((A,B,C,D,E), 3 to 5) => C,D, E
slice((A,B,C,D,E), -1) => E
slice((A,B,C,D,E), 1 by 2 to 5) => A, C, E
slice((A,B,C,D,E), 5 by -1 to 2) => E, D, C, B
slice((A,B,C,D,E), -1 to 4) => E, A, B, C, D
slice((A,B,C,D,E), (1,5,4)) => A, E, D
michaelhkay commented 3 years ago

And then a corresponding function array:slice().

For definitional purposes I'm wondering about defining operations like slice, subsequence, remove, reverse, insert-before in terms of corresponding operations on arrays, so fn:slice($seq, $range) is defined as array:slice(array{$seq}, $range)?*.s

A difficulty with this is the different conventions for handling "index out of bounds" conditions.

benibela commented 3 years ago

I'm proposing an extension to range expressions so you can do

0 by 2 to 10 => 0,2,4,6,8,10

That by could be defined as a filtering operator

  0 to 10 by 2 => (0 to 10) by 2 => 0,2,4,6,8,10

As such it could work with all sequences

  ("a", "b", "c", "d", "e") by 1 => ("a", "b", "c", "d", "e")
  ("a", "b", "c", "d", "e") by 2 => ("a", "c", "e")
  ("a", "b", "c", "d", "e") by 3 => ("a", "d")
  ("a", "b", "c", "d", "e") by 4 => ("a", "e")
  ("a", "b", "c", "d", "e") by 5 => ("a")

And then a slice() function so that

Or in the filter expression

 (A,B,C,D,E)[3, 4, 5] => C,D, E
 (A,B,C,D,E)[3 to 5] => C,D, E
 (A,B,C,D,E)[1 to 5 by 2 ] => A, C, E

the different conventions for handling "index out of bounds" conditions.

have been an extraordinary bad idea from the beginning

michaelhkay commented 3 years ago

On 21 Nov 2020, at 22:48, Benito van der Zander notifications@github.com wrote:

I'm proposing an extension to range expressions so you can do

0 by 2 to 10 => 0,2,4,6,8,10 That by could be defined as a filtering operator

0 to 10 by 2 => (0 to 10) by 2 => 0,2,4,6,8,10 As such it could work with all sequences

("a", "b", "c", "d", "e") by 1 => ("a", "b", "c", "d", "e") ("a", "b", "c", "d", "e") by 2 => ("a", "c", "e")

Yes, I did think about that, and I'm not sure now why I didn't go for it. Probably decided it was too "clever". I think it parses unambiguously, despite "by" already being used as a language keyword (in "group by" and "order by"). In fact, keeping "to" and "by" as regular binary operators makes parsing easier, at least if you're got a precedence-based parser.

Or in the filter expression (A,B,C,D,E)[3, 4, 5] => C,D, E (A,B,C,D,E)[3 to 5] => C,D, E (A,B,C,D,E)[1 to 5 by 2 ] => A, C, E That gets quite messy in edge cases especially where the expression used in the predicate is context-dependent, and can yield either a boolean or a number. It would be a natural if we didn't already have the difficult overloading of A[B] to do both filtering and subscripting. You have to ask what expressions like //empl[1 to xs:integer(@height)] are supposed to mean. Or for that matter, //empl[1 to position()]. It might be doable, but the rules would certainly be complicated.

One could consider a new operator that only does numeric indexing, e.g. //emp[| 1 to 10 |] but then you might as well use a function.

The real issue here is that with numeric indexing, you really don't want the expression that computes the subscript to be focus-dependent. Perhaps we could formalise a definition of "focus-dependent expression" (that's statically determinable) and then use it in the semantics: "if the predicate is focus-independent and evaluates to a sequence of integers, then...". (It doesn't help that non-integer numerics also need to be considered.)

There's also the question about whether you retain the order of the sequence or the order of the integers. With the slice() function, I consciously made slice((A,B,C), (2,1)) return (B,A) rather than (A,B), but the other option is also viable, and more consistent if you think of [] as a filtering operation.

the different conventions for handling "index out of bounds" conditions.

have been an extraordinary bad idea from the beginning

Yes. Classic committee problem: there are arguments in favour of making it an error, and arguments in favour of making it select nothing, but there are no good arguments for doing it one way for sequences and a different way for arrays.

Michael Kay Saxonica

michaelhkay commented 3 years ago

That by could be defined as a filtering operator

0 to 10 by 2 => (0 to 10) by 2 => 0,2,4,6,8,10 As such it could work with all sequences

("a", "b", "c", "d", "e") by 1 => ("a", "b", "c", "d", "e") ("a", "b", "c", "d", "e") by 2 => ("a", "c", "e")

Yes, I did think about that, and I'm not sure now why I didn't go for it.

I've now remembered why I didn't do it this way: I wanted to allow descending sequences like (10 to 1 by -1) or (-1 to -5 by -2).

(10 to 1) is an empty sequence, so ((10 to 1) by -1) would still be an empty sequence if "by" is a binary operator.

Michael Kay Saxonica

rhdunn commented 3 years ago

I personally find 1 to 10 by 2 easier to read than 1 by 2 to 10, that is -- define the range, then the step. The former is also the way other languages like Scala define the ordering.

Other than that, I'm in favour of this extension.

benibela commented 3 years ago

That gets quite messy in edge cases especially where the expression used in the predicate is context-dependent, and can yield either a boolean or a number. It would be a natural if we didn't already have the difficult overloading of A[B] to do both filtering and subscripting.

I have argued for these things before

There I wrote A[B] could mean A[let $b := B return if (count($b) le 1 or head($b) instance of node()) then $b else position() = $b ]

Or perhaps A[ position() = B[. instance of xs:numeric] or B[not(. instance of xs:numeric)] ] would be more intuitive (but a bigger change, since it would treat [(1, <x/>)] as [true()])

You have to ask what expressions like //empl[1 to xs:integer(@height)] are supposed to mean. Or for that matter, //empl[1 to position()]. It might be doable, but the rules would certainly be complicated.

//empl[1 to position()] simply would return everything

//empl[1 to xs:integer(@height)] would probably be //empl[position() <= xs:integer(@height)]

benibela commented 3 years ago

(10 to 1) is an empty sequence, so ((10 to 1) by -1) would still be an empty sequence if "by" is a binary operator.

Perhaps by with negative number could be defined to reverse the sequence

Then 1 to 10 by -1 would be descending