SASDigitalHumanitiesTraining / TextEncoding

Text Encoding for Ancient and Modern Literature, Languages and History
9 stars 5 forks source link

Question about xpath // #26

Open bitparity opened 2 years ago

bitparity commented 2 years ago

So at 11:42 of "Advanced Digital Editing: Introduction to XPath II", it says ./descendent::head is the same as //head which I do find to be the case. But in the book XQuery for Humanists (p.62) it says A double slash (//) stands for /descendant-or-self::node()/.

I know from having debugged a problematic query that /descendant-or-self::node()/head is not the same as /descendent-or-self::head (particularly when it comes to looking for attributes within /head which I think are technically siblings, not descendents), but I don't know why, especially since functionally it seems to just make // equivalent to, as mentioned in the video, ./descendent::head.

Can you possibly explain the difference between the two definitions (yours and XQH's) for // ?

gabrielbodard commented 2 years ago

Two differences:

  1. descendant-or-self:: is correct, because // can also find the root node, not only descendants of it;
  2. node() is technically correct, but irrelevant in practice—among other things it allows the XPath to match nodes other than elements, but I don't think attributes, text nodes, processing instructions or comments will ever have child nodes—at least not in the kind of XML we're likely to need to work with.

I think that the two definitions are functionally equivalent though. Can you find an example of an XPath match for /descendant-or-self::node()/element that gives different results or counts from /descendant::element ?

bitparity commented 2 years ago

So I've managed to draw up a test example illustrating the issue.

Sample xml:

<body>
    <p lang="la" id="p-1">
        <s id="s-1">sent 1</s>
        <s id="s-2">sent 2</s>
    </p>
    <p lang="en" id="p-2">para 2</p>
</body>

The goal is to find all elements that have an @id attribute where the parent or self element has a @lang attribute.

The below two xpath searches are identical, as per the aforementioned definitions of // in the XQH book and the workshop youtube video. However, they don't seem to note the <p> element which has both @lang and @id attributes:

.//*[./@lang = "la"]/descendant-or-self::node()/*[./@id != ""]
.//*[./@lang = "la"]//*[./@id != ""]

returns

<s id="s-1">sent 1</s>
<s id="s-2">sent 2</s>

The below xpath search DOES note the <p> element with both @lang and @id attributes, raising the point of the dissimilarity between this xpath and the above two.

.//*[./@lang = "la"]/descendant-or-self::*[./@id != ""]

returns

<p lang="la" id="p-1">
   <s id="s-1">sent 1</s>
   <s id="s-2">sent 2</s>
</p>
<s id="s-1">sent 1</s>
<s id="s-2">sent 2</s>

I'm sure most of the time, this is just theoretical, but this is a specific instance where it affected one of my queries. I agree thinking of // as descendant:: is easier, which is why i was puzzled by the XQH book's full definition of /descendant-or-self::node()/, which appears to be both true AND confusing (since it apparently cancels the self part out).

gabrielbodard commented 2 years ago

Interestingly, this looks like it has just proved that when you want descendant-or-self::* you can't just use .//*, which in practice means descendant::*

So while I have no doubt the XQH definition is correct, it doesn't look like ours is wrong after all…

bitparity commented 2 years ago

I think I realized what the problem was, from p.53 of the Walmsley XQuery book (which also gave the same node definition for //).

Whenever you type the name of an element after a /, it is technically child::element.

So //element is technically /descendent-or-self::node()/child::element, which forces the search for <element> down to the descendant but not the self of the context, making it different from /descendent-or-self::element.

I think anyways.