antchfx / htmlquery

htmlquery is golang XPath package for HTML query.
https://github.com/antchfx/xpath
MIT License
723 stars 73 forks source link

Access inner text from node with other elements around #48

Closed scharph closed 2 years ago

scharph commented 2 years ago

I want to access the inner text " bar" from the node '\<code>' which also has some '\<span>' elements inside

<html>
  <body>
    <pre>
        <code>
            <span>foo</span>
            <span>:</span> bar <!-- <<<< target -->
            <span>;</span>
        </code>
    </pre>
  </body>
</html>
doc, err := htmlquery.LoadDoc(file)
if err != nil {
    return err
}
node := htmlquery.FindOne(doc, "html/body/pre/code")

fmt.Println(htmlquery.InnerText(node))

Actual result: "foo: bar;" Expected: "bar"

For some reason the span's inner texts are also included.. how can I prevent this?

Tried it with this path tester extendsclass.com/xpath-tester with the query "//body/pre/code/text()" and it returns the expected value

Any ideas?

zhengchun commented 2 years ago

htmlquery.InnerText will output target element node all inner text including its child nodes.

Try change XPath to html/body/pre/code/span[2]/following-sibling::text()

Another way is using //pre/code/text() get all text node and then concat it.

list := htmlquery.Find(doc, "//pre/code/text()")
for _, node := range list {
    fmt.Println(strings.TrimSpace(htmlquery.InnerText(node)))
}
scharph commented 2 years ago

Thank you .. works as expected 👍