Fix a bug in xml stream parsing where a previously unmatched node causing all subsequent valid matches fail.

jf-tech commented 4 years ago

Recall that for streaming mode, we have two xpaths: one for matching the element, the other (optionally) for adding additional filtering. Imagine the following example, where the xml doc is:

    <ROOT>
        <AAA>
            <CCC>c1</CCC>
            <BBB>b1</BBB>
            <DDD>d1</DDD>
            <BBB>b2<ZZZ z="1">z1</ZZZ></BBB>
            <BBB>b3</BBB>
        </AAA>
        <ZZZ>
            <BBB>b4</BBB>
            <BBB>b5</BBB>
            <CCC>c3</CCC>
        </ZZZ>
    </ROOT>

The stream parser is created as:

    CreateStreamParser(strings.NewReader(s), "/ROOT/*/BBB", "/ROOT/*/BBB[. != 'b3']")

Basically we want the stream parser to return all the BBB nodes whose text aren't b3. By looking at the sample XML, we know it should return: the <BBB> nodes whose texts are b1, b2, b4, and b5.

However, the current code only returns b1 and b2.

The problem lies in the stream element matching inside case xml.StartElement.

Currently the code does this:

case xml.StartElement:
    ...
    ...
    if p.streamElementXPath != nil {
        if p.streamNode == nil {
            if QuerySelector(p.doc, p.streamElementXPath) != nil {
                // Now we assume this node is the stream node candidate.
            }

We originally under the assumption that if the streamElementXPath query returns anything, it must be this node itself; thus if it returns, this node is the stream node candidate.

But it's clearly wrong in this b3 example above. For the node <BBB>b3</BBB> it is first considered as the stream candidate, but later filtering ([. != 'b3']) removes its stream node status, and treats it just like any other non-stream nodes, and keeps it in the node tree. But the problem is, by keeping it in the tree, any future XML element start will always "matches" streamElementXPath. So in the example above, the node <ZZZ> is now erroneously considered stream node, and any child nodes are not even tested for streaming anymore.

There are two fixes:

1) In xml.StartElement stream match, instead of just doing QuerySelector(...) != nil check, we need to issue a QuerySelectorAll(...) call and examine all the returned nodes, if the current node is one of them, then this current node is considered stream candidate.

2) Simpler: if a stream candidate is later filtered out inside case xml.EndElement handling, then simply remove it from the node tree, thus preventing future erroneous matches.

Fix 1) seems an overkill: if a stream candidate gets filtered out, it's hard to imagine caller would like to interact with it in any capacity. Also imagine if any XML doc has lots of <BBB>b3</BBB> nodes, the memory growth would be really bad. All things considered, chose fix 2).

coveralls commented 4 years ago

Coverage increased (+0.03%) to 91.59% when pulling e26cec513303f0794d94c9b80cee131eb925908a on jf-tech:jf-tech/streamparser-fix into 5648b2f39e8d5d3fc903c45a4f1274829df71821 on antchfx:master.

jf-tech commented 4 years ago

@zhengchun just make sure it's on your radar with this bug fix.

zhengchun commented 4 years ago

I released a new version on https://github.com/antchfx/xmlquery/releases/tag/v1.3.1

antchfx / xmlquery

Fix a bug in xml stream parsing where a previously unmatched node causing all subsequent valid matches fail. #40