antchfx / xmlquery

xmlquery is Golang XPath package for XML query.
https://github.com/antchfx/xpath
MIT License
444 stars 89 forks source link

add cdata node type. #31

Closed james-lawrence closed 4 years ago

james-lawrence commented 4 years ago

fixes #30.

updated travis.yml to not test against golang 1.7 and 1.8 due to group cache pulling in math/bits which were not backported. (imo should just remove groupcache and add an internal/lru package to supply the 100 lines or so of code for the cache. but out of scope for this PR)

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-0.06%) to 85.973% when pulling e93f949755b74664dc5198d923c1c1a8cede4b74 on james-lawrence:enh/add-cdata-node into 96baba5f1e7e8d1e44efcbf1bb0827f6c8181232 on antchfx:master.

jf-tech commented 4 years ago

@zhengchun @james-lawrence I have a question about this PR:

With this change, xmlquery.parse() no longer creates a node with type == TextNode anymore. As far as I can tell, only place we create a node with type == TextNode is during the xpath navigation:

func getCurrentNode(it *xpath.NodeIterator) *Node {
    n := it.Current().(*NodeNavigator)
    if n.NodeType() == xpath.AttributeNode {
        childNode := &Node{
            Type: TextNode,
            Data: n.Value(),
        }
        return &Node{
            Parent:     n.curr,
            Type:       AttributeNode,
            Data:       n.LocalName(),
            FirstChild: childNode,
            LastChild:  childNode,
        }
    }
    return n.curr
}

Want to confirm this is the right construct of using TextNode. Basically I'm questioning why we needed to introduce CharDataNode type. It is not necessary to have absolutely matching names for the types. Calling it TextNode to represent a CDATA section is fine in my opinion. This way, we can use TextNode to represent anything related to raw text, including raw text in element nodes, or raw text for attributes. That said, I'm not too obsessed with this TextNode name: I'm perfectly okay we switch to use CharDataNode (given the use of CharData in std lib xml). If that's what we want, then we should get rid of TextNode. Basically I'm asking: we should have just one of them, either CharDataNode or TextNode, not both (personally like TextNode a bit more since it's more generic, plus it matches well with xpath.NodeType very well). Agree, disagree?

zhengchun commented 4 years ago

CharDataNode and TextNode is the two node type for text, they are very different. CharDataNode can include the specified characters like <,>,& and more others, I can include HTML code in the XML via uses <![CDATA[ ]]>https://en.wikipedia.org/wiki/CDATA#CDATA_sections_in_XML . It is useful when call OutputXML() function to output XML. Sometime we need print all XML body even include <![CDATA[ ]]>. If we missing CharDataNode we will don't know this node text whether has include HTML character and output the XML will been broken.

<style type="text/css">
/*<![CDATA[*/
body { background-image: url("marble.png?width=300&height=300") }     
/*]]>*/
</style>
zhengchun commented 4 years ago

I checked code again, I found there is not implement <![CDATA[ ]]> on OutputXML() function. This is a Bug.

jf-tech commented 4 years ago

@zhengchun then what is TextNode is used for? Apparently in xmlquery library, TextNode isn't used at all during parsing, so basically we will never get a *Node tree that has a TextNode in it. The only place TextNode is used is to construct an artificial node to hold attribute value during NodeNavigator traverse. If that's the only use case, why can't we use CharDataNode in this case?

My whole point is: seems like we don't need both TextNode or CharDataNode. I don't care about the naming, either one is fine with me. But having two of them lying around adds confusion and potentially bug (as you discovered).

zhengchun commented 4 years ago

@jf-tech , If package just for query, we don't need have two TextNode and CharDataNode. Sometime we need save the query result to XML file(although it may be infrequent). CharDataNode can flag this node including some specified characters. TextNode is flag all text that don't include <![CDATA[ ]]>. For example:

    s := `<?xml version="1.0"?>
    <script>
 <![CDATA[ They're saying "x < y" & that "z > y" so I guess that means that z > x ]]>
</script >
    `
    doc, _:= xmlquery.Parse(strings.NewReader(s))
    fmt.Println(doc.OutputXML(false))
        saveXML(doc.OutputXML(false))

If missing CDATA, the output like below:

<?xml version="1.0"?><script>They're saying "x < y" & that "z > y" so I guess that means that z > x</script>

Apparently this XML has broken, if we save this output to the XML file and read it in the other tool.

zhengchun commented 4 years ago

@jf-tech , I add a bug report about <![CDATA[ ]]> at https://github.com/antchfx/xmlquery/issues/36

jf-tech commented 4 years ago

Okay, I'm fine now, thx for the discussion.

While my primary objection was this: given xmlquery.parse() never really creates a TextNode, and a TextNode is only created during xpath navigation getCurrentNode() for attribute node's value, why did we ever need the type. But now I realize this: given we need this OutputXML functionality, then we do want different ways to serialize out an element node's text data (in CDATA format), vs an element node's attribute value (in a direct quoted text format). If we all use CharDataNode for attribute value node in getCurrentNode() it would cause OutputXML spit out the attribute value looking completely goofy.

I'm good now. Thx!