antchfx / htmlquery

htmlquery is golang XPath package for HTML query.
https://github.com/antchfx/xpath
MIT License
738 stars 74 forks source link

Get a panic when parse html page #53

Closed aaronchen2k closed 2 years ago

aaronchen2k commented 2 years ago

Get a fatal panic when executing htmlquery.QueryAll on webpage from url https://baidu.com OR local file baidu.html as below script. https://github.com/aaronchen2k/deeptest/blob/main/cmd/test/htmlquery_test.go

It works well if use a html string like: https://github.com/aaronchen2k/deeptest/blob/main/internal/server/modules/v1/helper/mock/html.go

Thanks!

zhengchun commented 2 years ago

May be the http response is gzip mode. you should decompress gzip before parsing .

aaronchen2k commented 2 years ago

May be the http response is gzip mode. you should decompress gzip before parsing .

In this test script test/htmlquery_test.go' , I read html from a local file, still cause a fatal panic. Please help to check, thanks.

html := fileUtils.ReadFile("baidu.html")

zhengchun commented 2 years ago

The local baidu.html file is good on my local test code.

test code below:

    f, err := os.Open("./baidu.html")
    if err != nil {
        panic(err)
    }
    doc, err := htmlquery.Parse(f)
    if err != nil {
        panic(err)
    }
    //  "//form[@id=1]/input[@id=\"kw\"]/@class" is invalid. changed to @id="1", 
    expression := `//form[@id="1"]/input[@id="kw"]/@class`
    list, err := htmlquery.QueryAll(doc, expression)
    if err != nil {
        panic(err)
    }
    fmt.Println(len(list))
aaronchen2k commented 2 years ago

The local baidu.html file is good on my local test code.

test code below:

  f, err := os.Open("./baidu.html")
  if err != nil {
      panic(err)
  }
  doc, err := htmlquery.Parse(f)
  if err != nil {
      panic(err)
  }
  //  "//form[@id=1]/input[@id=\"kw\"]/@class" is invalid. changed to @id="1", 
  expression := `//form[@id="1"]/input[@id="kw"]/@class`
  list, err := htmlquery.QueryAll(doc, expression)
  if err != nil {
      panic(err)
  }
  fmt.Println(len(list))

Thank you for feedback! I update the codes, now there is no error, but why the list always nil?

image image
aaronchen2k commented 2 years ago

up

zhengchun commented 2 years ago

your query xpath is not correct. The local html file no any form with id=1 attribute. //form[@id="form"]/input[@id="kw"]/@class. You can use chrome develop tool(Inspect) or https://www.freeformatter.com/xpath-tester.html to test your xpath.