antchfx / htmlquery

htmlquery is golang XPath package for HTML query.
https://github.com/antchfx/xpath
MIT License
727 stars 73 forks source link

Querying for text() using contains() returns duplicate results #24

Closed vovchynniko closed 3 years ago

vovchynniko commented 4 years ago

Hello zhengchun,

Thank you for your great library. I'd like to file one issue with text().

Having this simple HTML page

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Test</title>
</head>
<body>
<div>
    <div class="body">
        <strong>We need this.</strong>
        <strong>Not this.</strong>
    </div>
    <strong>And we definitely don't need this.</strong>
</div>
</body>
</html>

I want to extract "We need this." text node. Here's my code:

package main

import (
    "fmt"
    "github.com/antchfx/htmlquery"
    "os"
    "strings"
)

func main() {
    file, err := os.Open("example.html")
    if err != nil {
        panic(err)
    }
    defer file.Close()

    root, err := htmlquery.Parse(file)
    if err != nil {
        panic(err)
    }

    nodes := htmlquery.Find(root, `//div[@class="body"]//text()[contains(.,"need")]`)

    for _, node := range nodes {
        fmt.Println(htmlquery.InnerText(node))
        fmt.Println(strings.Repeat("-", 40))
    }
}

Unfortunately, the result is a bit more than I asked for:


        We need this.
        Not this.

----------------------------------------
We need this.
----------------------------------------
We need this.
----------------------------------------

Thank you and stay safe :)

zhengchun commented 4 years ago

What version number of htmlquery library you uses? I test on my local, it output correct.

func main() {

    s := `<!DOCTYPE html>
    <html lang="en">
    <head>
        <title>Test</title>
    </head>
    <body>
    <div>
        <div class="body">
            <strong>We need this.</strong>
            <strong>Not this.</strong>
        </div>
        <strong>And we definitely don't need this.</strong>
    </div>
    </body>
    </html>`
    doc, _ := htmlquery.Parse(strings.NewReader(s))

    nodes := htmlquery.Find(doc, `//div[@class="body"]//text()[contains(.,"need")]`)

    for _, node := range nodes {
        fmt.Println(htmlquery.InnerText(node))
        fmt.Println(strings.Repeat("-", 40))
    }
}

$ go run main.go

We need this.
----------------------------------------

try update you htmlquery and xpath library.