Add streaming XML loading/parsing support

jf-tech commented 4 years ago

BIG NOTE: this is just a proposal and proof of concept. Once consensus is reached, we can/will make the PR a lot more formal and rigorous.

We have use cases where we need to load/parse large XML documents in streaming fashion to avoid memory constraints. Example XML:

<AAA>
   <BBB>
      ....
      ....
   </BBB>
   <BBB>
      ....
      ....
   </BBB>
   ....
</AAA>

Currently we duplicate the node.go/parse() function in our own private repo and implement the stream parsing. As a result, we had to duplicate Node struct, as well as NodeNavigator impl.

Thought if you want to take a look and consider if it's a use case others might benefit from?

Limitations:

The streaming target xpath support is very very basic: /AAA/BBB only. I always wanted to stay away from using string, but instead a compiled *xpath.Expr. That way most of xpath features/functionalities (minus those XPATH directives involving position() etc) could be supported. The difficulty here is when I get an element *Node, how to evaluate it against the caller specified *xpath.Expr to see if the node "fits" the xpath? I could traverse the element node all the way to the top and then do select with the *xpath.Expr against the top node, but when the expr returns how do I know the returned node is the element node I'm interested in?
Due to the nature and purpose of streaming, every time when a target node is read and done processing by the caller, the very next StreamParser.Read() call would remove the previous target node from the DOM tree (otherwise, the DOM tree would eventually grow to the full extent of the XML document, which defeats the purpose of streaming). Now that's a bit unorthodox if you might say. Want to know what's your opinion here. The behavior described here fits our needs perfectly though.

@zhengchun ^^

PS we also have similar needs in jsonquery where we want to stream read a very large json document.

coveralls commented 4 years ago

Coverage decreased (-0.2%) to 85.742% when pulling 443d4e33562eacd386b84de0e9c151c2916575a5 on jf-tech:jf-tech/streaming into c43a9ea44a41141abd40cc2125be9558b5ae8763 on antchfx:master.

zhengchun commented 4 years ago

Hello, first, thanks for your support. I've been a little busy on days. Add streaming feature to reading a large XML document is a good idea. Is it only supported the basic XPath query? like "/aaa/bbb", looks it not support logical operation query. I need some time to consider whether we can support XPath.Expr in streaming parser. Thanks again!

jf-tech commented 4 years ago

Hello, first, thanks for your support. I've been a little busy on days. Add streaming feature to reading a large XML document is a good idea. Is it only supported the basic XPath query? like "/aaa/bbb", looks it not support logical operation query. I need some time to consider whether we can support XPath.Expr in streaming parser. Thanks again!

Yes, the proof of concept PR only supports /aaa/bbb style simple xpath. As I mentioned above, I'd really like to utilize *xpath.Expr to determine whether a node is matched and considered the streaming target node. Now after a bit more thinking, perhaps my concern of using *xpath.Expr mentioned above isn't really well-founded. Consider this following example:

<aaa>
   <bbb>
      ....
   </bbb>
   <bbb>
      ....
   </bbb>
</aaa>

When the StreamParser.Read() is called for the first time, the first <bbb> node is returned; and when StreamParser.Read() is called again, before the second <bbb> is loaded, parsed, and returned, the first <bbb> node is deleted from the DOM tree. Given this implementation, I think we can in fact in func isStreamTarget(...) do the follow:

   1) given a `*Node`, traverse all the way up to root `ElementNode`.
   2) query the `*xpath.Expr` against the root, if it returns a node (it should and will only return one node
        max), then the given node is a target node; if the query returns nothing, then this node isn't a match.

(of course if the *xpath.Expr uses positional directives such as position(), then it wouldn't work, nor should we expect it to be)

This aforementioned logic works because we always delete target nodes from previous StreamParser.Read() calls, thus ensuring the *xpath.Expr query result is unique, if anything matches.

Updated the PR with a commit contains the the logic above. In fact I figured there is no need to do traverse-up at all: we can simply do the xpath.Expr query from the DocumentNode p.doc every single time.

(BTW, I worked on your xpath lib previously with adding reverse(): https://github.com/antchfx/xpath/pull/46. I'm just now hiding under a different github account 😊)

zhengchun commented 4 years ago

I saw your new method and tested some cases, it looks good, it has supported the basic logical operations, like //aaa,//a[@id=''], position()=x , 😄 . Could you return an error object for CreateStreamParser(), panic() will trash application.

jf-tech commented 4 years ago

Great, if you are happy with the basic concept and implementation of the proposal, I will close this PR and start submitting official one (or ones to break it into smaller chunks) with more rigor and testing. Yes will not use panic. :)

jf-tech commented 4 years ago

@zhengchun oh, let me you if you prefer a single PR contains all changes, or you prefer or are okay with smaller incremental PRs. Typically I'd like to have small incremental (but still functional, fully test covered and regression free) PRs instead of a big one. But this is your project, so please advise.

jf-tech commented 4 years ago

And last but not least, I'd make sure all the new code is thoroughly tested, which means I probably need to introduce two new 3rd party dependencies:

https://github.com/stretchr/testify Testify will be used for all kinds of assertions in testing.
https://github.com/bradleyjkemp/cupaloy Cupaloy will be used for snapshot generation and verification. Trees are complex specially the *Node contains so many links and pointers. Having test result snapshotted and verified ease the pain of individually verifying all node fields' integrity.

What do you think?

zhengchun commented 4 years ago

Okay 😄

jf-tech commented 4 years ago

Close the proof-of-concept PR per agreement above. Small/incremental PRs are coming.

jf-tech commented 4 years ago

Reopening because I realized a bug in my implementation that needs some discussion with @zhengchun .

My current implementation is roughly this: during the handling of xml.StartElement, we'll test the element node against the caller specified xpath.Expr. If it matches, then we'll tag this element node as target node; and when its matching xml.EndElement is done processing (detected by using streamElementCounter), which means this element node is fully processed, we'll return it to the caller.

This implementation doesn't work with the following xpath:

/AAA/BBB[. != 'b1'] for <AAA><BBB>b1</BBB></AAA>. We know the <BBB> node should not be returned caller because its text is b1 and the streaming xpath specifically asks != 'b1'. This bug exists because when we are handling the BBB's xml.StartElement, none of its child nodes are read in and processed yet, which would lead to the xpath evaluation's premature success.

The fundamental problem in the current proposed implementation is that we do stream xpath evaluation at xml.StartElement time, at which moment the element node is practically empty and not finished processing yet.

Now you might say, how about "let's instead do stream xpath evaluation at xml.EndElement?" Well, some issues:

1) perf would be bad, current proposal will not do any xpath evaluation inside/underneath the target node; but if we do xpath evaluation at xml.EndElement, the xpath evaluation is guaranteed to be executed on every single element end of the entire xml doc. That's pretty costly.

2) during xml.EndElement, if the xpath evaluation is true, we don't know if this element node is the target node or not. Check out this example:

   <AAA>
       <BBB>b1<CCC/></BBB>
   </AAA>

And streaming xpath is /AAA/BBB[. = 'b1'].

Now when we handle `</CCC>` `xml.EndElement`, the streaming xpath evaluation from the root would succeed. Does that mean we can return <CCC> as target node? Absolutely not. The question is now: **when stream xpath evaluation succeeds, how do we know we can return the element node as target node.**

One algorithm would be: for an `xml.EndElement`, if the stream xpath eval succeeds, we need to do another stream xpath eval check with the current ending element node removed. If that check still succeeds, then this current element node isn't the target node. Otherwise, then the current ending element node is the target node.

That algo seems to be the correct one. But I haven't figured out a way to actually get the corresponding element node during `xml.EndElement`. Any idea?

Thoughts? Or we're over-engineering? Originally my proposal only supports extremely simple "/AAA/BBB" style stream xpath. With the supporting of full fledge xpath.Expr, we are adding a lot complexity into this. Remember, we might even see xpath.Expr like /AAA/BBB[....] | /AAA/CCC[....]. Oh my!

Or we can simply stick with this proposed implementation and call out limitation in the documentation.

jf-tech commented 4 years ago

I just realized there is another way to solve this problem, without loosing much perf and without adding too much complexity.

basically we can introduce an optional xpath.

For simple case: CreateStreamParser(..., "/AAA/BBB"), the current implementation works.

But for the tricky case: where we eventually want: "/AAA/BBB[. != 'b1']", we ask caller to actually pass in two xpaths:

CreateStreamParser(..., "/AAA/BBB", "/AAA/BBB[. != 'b1']")

The code will use the first xpath to do evaluation in xml.StartElement to basically nail down the target element node. Thus the first xpath should always be a simple xpath with only elemental walkthrough. The second xpath, should it be needed, can contains more fine-grained filtering. Our code will do a second xpath evaluation at xml.EndElement when the element counter reaches 0.

jf-tech commented 4 years ago

Okay, I'm happy with the two xpath solution proposed above. Closing this prototype PR again.

antchfx / xmlquery

Add streaming XML loading/parsing support #32