antchfx / htmlquery

htmlquery is golang XPath package for HTML query.
https://github.com/antchfx/xpath
MIT License
739 stars 74 forks source link

can not get element in noscript tag #74

Closed feeops closed 2 months ago

feeops commented 2 months ago

demo code

package main

import (
    "fmt"
    "github.com/antchfx/htmlquery"
    "strings"
)

func main() {
    s := `<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <meta name="robots" content="noindex">
        <meta content="always" name="referrer">
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <meta http-equiv="Pragma" content="no-cache">
        <meta http-equiv="Cache-Control" content="no-cache">
        <meta http-equiv="Expires" content="0">
        <title>title</title>
        <script type="text/javascript">
            function dm()
            {
                var u = "https://www.baidu.com";
                location.replace(u);
            }
            setTimeout(dm, 80);
        </script>
    <noscript><meta http-equiv="refresh" content="0;url=https://www.baidu.com"></noscript></head>
<body></body>
</html>`
    doc, err := htmlquery.Parse(strings.NewReader(s))
    if err != nil {
        panic(err)
    }
    list := htmlquery.Find(doc, "//meta")

    for _, n := range list {
        fmt.Println(n.Data, n.Attr) // output @href value
    }

}

I can not get meta data in noscript tag

zhengchun commented 2 months ago

Looks the html package takes the noscript's body as a pure text.

n := htmlquery.FindOne(doc, "//noscript")
fmt.Println(n.FirstChild.Type) // got `TextNode` type, expected `ElementNode` type
StJudeWasHere commented 2 months ago

I've been having this issue as well, but the solution turned out to be easier than I expected. I'm leaving it here as a reference in case anyone else is also looking into it.

It seems like the html package can parse HTML with options, and setting ParseOptionEnableScripting to false does the trick and returns the node as ElementNode. In the demo code, the htmlquery.Parse can be replaced with html.ParseWithOptions.

package main

import (
    "fmt"
    "github.com/antchfx/htmlquery"
    "golang.org/x/net/html"
    "strings"
)

func main() {
    s := `<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <meta name="robots" content="noindex">
        <meta content="always" name="referrer">
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <meta http-equiv="Pragma" content="no-cache">
        <meta http-equiv="Cache-Control" content="no-cache">
        <meta http-equiv="Expires" content="0">
        <title>title</title>
        <script type="text/javascript">
            function dm()
            {
                var u = "https://www.baidu.com";
                location.replace(u);
            }
            setTimeout(dm, 80);
        </script>
    <noscript><meta http-equiv="refresh" content="0;url=https://www.baidu.com"></noscript></head>
<body></body>
</html>`
    doc, err := html.ParseWithOptions(strings.NewReader(s), html.ParseOptionEnableScripting(false))
    if err != nil {
        panic(err)
    }
    list := htmlquery.Find(doc, "//meta")

    for _, n := range list {
        fmt.Println(n.Data, n.Attr) // output @href value
    }
}