antchfx / htmlquery

htmlquery is golang XPath package for HTML query.
https://github.com/antchfx/xpath
MIT License
723 stars 73 forks source link

Xpath failing to find nodes #55

Closed waldner closed 1 year ago

waldner commented 1 year ago

Here's a simple reproducer:

package main

import (
  "fmt"
  "strings"
  "github.com/antchfx/htmlquery"
)

func main(){

        str := `<html><head/><body>
<div id="main">
  <table class="generictable" width="100%" border="1" cellpadding="0" cellspacing="0">
    <tr><td class="hidden">XXXX</td></tr>
    <tr class="foo" id="222222"><td class="hidden">1234</td></tr>
  </table>
</div>
</body></html>
`

        tree, _ := htmlquery.Parse(strings.NewReader(str))

        // does not work
        fmt.Println(htmlquery.Find(tree, "//div[@id='main']/table/tr[@class='foo']"))
        // does not work either
        fmt.Println(htmlquery.Find(tree, "//div[@id='main']/table/tr"))
        // does not work either
        fmt.Println(htmlquery.Find(tree, "//table/tr"))
        // works
        fmt.Println(htmlquery.Find(tree, "//tr[@class='foo']"))

}

Output:

[]
[]
[]
[0xc000192620]

This is all pretty basic stuff, and all four Xpath expressions successfully match nodes with python's lxml, for example. Am I doing something wrong?

zhengchun commented 1 year ago

@waldner, Hello, your xpath is correct, but the Golang's html parser package will automatic append tbody element into the table element.

fmt.Println(htmlquery.OutputHTML(tree, true))

output:

<table class="generictable" width="100%" border="1" cellpadding="0" cellspacing="0">
          <tbody><tr><td class="hidden">XXXX</td></tr>
          <tr class="foo" id="222222"><td class="hidden">1234</td></tr>
        </tbody></table>

So just change to fmt.Println(htmlquery.Find(tree, "//div[@id='main']/table/tbody/tr[@class='foo']"))

waldner commented 1 year ago

Ok, now that you mentioned the parser I did some research and I found this post which explains a bit what's going on: https://nikodoko.com/posts/html-table-parsing/ (I'm putting the link here in case someone else coming from google encounters the same problem).

Thanks!