antchfx / htmlquery

htmlquery is golang XPath package for HTML query.
https://github.com/antchfx/xpath
MIT License
739 stars 74 forks source link

'Evaluate' may return incorrect result ? #13

Closed jcamiel closed 3 years ago

jcamiel commented 5 years ago

Hello, First of all, thank you for you work, I really appreciate the quality of the lib. I've a strange result with a simple html file:

<html>
    <body>
        <div class="fruit">Apple</div>
        <div class="fruit">Banana</div>
        <div class="fruit">Lemon</div>
    </body>
</html>

I think the two following XPath expression string((//div)[2]) and string((//div[@class="fruit"])[2]) should give the same result. In my comprehension, the two expression select the second div and string function should return Banana. But the second one returns an empty string :

         html := `<html>
    <body>
        <div class="fruit">Apple</div>
        <div class="fruit">Banana</div>
        <div class="fruit">Lemon</div>
    </body>
    </html>`

    doc, _ := htmlquery.Parse(strings.NewReader(html))

    test1 := `string((//div)[2])`
    expr1, _ := xpath.Compile(test1)
    v1 := expr1.Evaluate(htmlquery.CreateXPathNavigator(doc))
    // v1 = "Banana"

    test2 := `string((//div[@class="fruit"])[2])`
    expr2, _ := xpath.Compile(test2)
    v2 := expr2.Evaluate(htmlquery.CreateXPathNavigator(doc))
    // v2 = ""

Do you have any idea or I'm wrong ?

Thank you

Jc

zhengchun commented 5 years ago

@jcamiel , I updated xpath package to fix this issue. Trying get the latest version of xpath package to solve.

jcamiel commented 5 years ago

Hi,

Thanks you for this really quick fix ! I've tested it and it's ok. Unfortunately, while I tried to debug it, I've found another possible issue (maybe related to the previous?):

<html>
    <body>
            <div class="fruit">Apple</div>
            <div class="color">Red</div>
    </body>
</html>

In this case, the following path string((//div[@class="color"])[1]) should return 'Red' (if I'm not mistaken). I've tested this one with your recent fix but it doesn't seem to work.

    html := `<html>
    <body>
            <div class="fruit">Apple</div>
            <div class="color">Red</div>
    </body>
    </html>`
    doc, _ := htmlquery.Parse(strings.NewReader(html))

    test1 := `string((//div[@class="color"])[1])`
    expr1, _ := xpath.Compile(test1)
    v1 := expr1.Evaluate(htmlquery.CreateXPathNavigator(doc))
    // v1 = "", should be Red ?

Regards,

jc

zhengchun commented 5 years ago

yes, should be Red. This bug still exist :(

zhengchun commented 5 years ago

@jcamiel , try again.

html := `<html><body>
    <div class="fruit">Apple</div>
    <div class="color">Red</div>
    <div>
    <div class="fruit">Pear</div>
    <div class="fruit">Orange</div>
    </div>
</body></html>`

doc, _ := htmlquery.Parse(strings.NewReader(html))
testPath := `//div[@class="color"][1]`
node := htmlquery.FindOne(doc, testPath)
fmt.Println(htmlquery.InnerText(node))
// Output: Red
testPath = `//div[@class="fruit"][1]`
node = htmlquery.FindOne(doc, testPath)
fmt.Println(htmlquery.InnerText(node))
// Output: Apple
testPath = `//div[1]`
for _, node := range htmlquery.Find(doc, testPath) {
    fmt.Println(htmlquery.InnerText(node))
    // Apple
    // Orange
}
testPath = `//div[2]`
for _, node := range htmlquery.Find(doc, testPath) {
    fmt.Println(htmlquery.InnerText(node))
    // Red
    // Pear
}
jcamiel commented 5 years ago

Hi @zhengchun I've tested your last commit and it's ok but now I've still have another problem :( ! Maybe related still related to the [%n] operator I don't know :

func TestEvalXPathHTML(t *testing.T) {
    html := `<html><body>
                <div class="fruit">
                    <div class="color">Red</div>
                </div>
                <div class="fruit">
                    <div class="color">Yellow</div>
                </div>
        </body></html>`

    doc, _ := htmlquery.Parse(strings.NewReader(html))
    test := `count((//div[@class="fruit"])[1]//div[@class="color"])`
    expr, _ := xpath.Compile(test)
    v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc))
    // v = 2, expected v = 1
}

I really appreciate your fixing and hope you don't mind me sending your this stuff. I've a whole html page that I'm trying to analyse for my work and I try to isolate to a minimal sample the xpath queries I think are not ok,

Regards,

Jc

zhengchun commented 5 years ago

I have trying [n] but found it is difficult. I can solved your //div[@class="fruit"])[1]//div[@class="color"]) but can't solve below example.

    html := `<html><body>
    <div class="fruit">Apple</div>
    <div class="color">Red</div>
    <div>
    <div class="fruit">Pear</div>
    <div class="fruit">Orange</div>
    </div>
    <div class="fruit">Peach</div>
</body></html>`
testPath := `//div[@class="fruit"][2]`

I have marked this issue.

jcamiel commented 5 years ago

Ok thank you very much, let me know if you have some code to test, I'll test it with my page, Best regards, Jc

zhengchun commented 3 years ago

fixed on xpath@v1.2.0