Closed feeops closed 2 months ago
Looks the html package takes the noscript
's body as a pure text.
n := htmlquery.FindOne(doc, "//noscript")
fmt.Println(n.FirstChild.Type) // got `TextNode` type, expected `ElementNode` type
I've been having this issue as well, but the solution turned out to be easier than I expected. I'm leaving it here as a reference in case anyone else is also looking into it.
It seems like the html package can parse HTML with options, and setting ParseOptionEnableScripting
to false
does the trick and returns the node as ElementNode
. In the demo code, the htmlquery.Parse
can be replaced with html.ParseWithOptions
.
package main
import (
"fmt"
"github.com/antchfx/htmlquery"
"golang.org/x/net/html"
"strings"
)
func main() {
s := `<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta name="robots" content="noindex">
<meta content="always" name="referrer">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Cache-Control" content="no-cache">
<meta http-equiv="Expires" content="0">
<title>title</title>
<script type="text/javascript">
function dm()
{
var u = "https://www.baidu.com";
location.replace(u);
}
setTimeout(dm, 80);
</script>
<noscript><meta http-equiv="refresh" content="0;url=https://www.baidu.com"></noscript></head>
<body></body>
</html>`
doc, err := html.ParseWithOptions(strings.NewReader(s), html.ParseOptionEnableScripting(false))
if err != nil {
panic(err)
}
list := htmlquery.Find(doc, "//meta")
for _, n := range list {
fmt.Println(n.Data, n.Attr) // output @href value
}
}
demo code
I can not get meta data in noscript tag