antchfx / htmlquery

htmlquery is golang XPath package for HTML query.
https://github.com/antchfx/xpath
MIT License
723 stars 73 forks source link

Xpath with | character does not return element with duplicated text #25

Closed WithoutPants closed 4 years ago

WithoutPants commented 4 years ago

Using a modified version of the html sample in query_test.go:

const htmlQuerySample = `<!DOCTYPE html><html lang="en-US">
<head>
<title>Hello,World!</title>
</head>
<body>
<div class="container">
<header>
    <!-- Logo -->
   <h1>City Gallery</h1>
</header>  
<nav>
  <ul>
    <li><a href="/London">London</a></li>
    <li><a href="/Paris">Paris</a></li>
    <li><a href="/Tokyo">Tokyo</a></li>
    <li><a href="/Tokyo2">Tokyo</a></li>
  </ul>
</nav>
<article>
  <h1>London</h1>
  <img src="pic_mountain.jpg" alt="Mountain View" style="width:304px;height:228px;">
  <p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
  <p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
</article>
<footer>Copyright &copy; W3Schools.com</footer>
</div>
</body>
</html>
`

(added extra Tokyo li element)

Using xpath "//nav/ul/li/a/text()|//div/ul/li/a/text()" (the second condition after the | is just an example and not meant to find anything additional), I would expect four elements to be returned - London, Paris, Tokyo, Tokyo. Instead, it returns three, dropping the second Tokyo element. Removing the | character and second condition returns four elements. Similarly, if I remove the text() part, then it also returns four elements.

Small unit test illustrating the issue:

func TestXpathQuery(t *testing.T) {
    reader := strings.NewReader(htmlQuerySample)
    doc, err := htmlquery.Parse(reader)

    if err != nil {
        t.Errorf("Error loading document: %s", err.Error())
        return
    }

    doTestXpath(t, doc, "//nav/ul/li/a/text()")
    doTestXpath(t, doc, "//nav/ul/li/a|//div/ul/li/a")
    doTestXpath(t, doc, "//nav/ul/li/a/text()|//div/ul/li/a/text()")
}

func doTestXpath(t *testing.T, doc *html.Node, xpath string) {
    found, err := htmlquery.QueryAll(doc, xpath)

    if err != nil {
        t.Error(err.Error())
    }

    if len(found) != 4 {
        t.Errorf("Xpath: %s - expected 4 elements found %d", xpath, len(found))
    }
}

Gives the following results:

--- FAIL: TestXpathQuery (0.00s)
    ...\xpath_test.go:978: Xpath: //nav/ul/li/a/text()|//div/ul/li/a/text() - expected 4 elements found 3
FAIL
zhengchun commented 4 years ago

htmlquery package will auto remove duplicated text since #20 .

WithoutPants commented 4 years ago

htmlquery package will auto remove duplicated text since #20 .

Except that it doesn't consistently.

Take the source string <sup>24</sup><sup>24</sup><sup>24</sup>

Perform a QueryAll using //sup/text().

Expected output would be:

24
24
24

Actual output is the same - it works as intended.

Now perform QueryAll using //sup/text()|/div/text(). The only difference here is the extra option. The expected output should be the same.

Actual output for this one is:

24

So at the very least there is inconsistency between the behaviour of the two queries, and my expectation would be that it is the second case that is performing incorrectly.

zhengchun commented 4 years ago

This is a bug, I had fixed at https://github.com/antchfx/xpath/commit/a015dcb7b81e05da303d5e5a91ef80688d3b6515 Thanks for your feedback.

WithoutPants commented 4 years ago

Thank you. I can confirm that this has fixed the issue.

WithoutPants commented 4 years ago

When are we likely to see a new release with this fix?