Open nathan-osman opened 7 years ago
After further investigation, the problem appeared to originate in Cascadia:
package main
import (
"fmt"
"strings"
"github.com/andybalholm/cascadia"
"golang.org/x/net/html"
)
const data = `<noscript><a href="http://example.org">click</a></noscript>`
func main() {
n, err := html.Parse(strings.NewReader(data))
if err != nil {
fmt.Println(err)
return
}
s, err := cascadia.Compile("noscript a")
if err != nil {
fmt.Println(err)
}
fmt.Println(len(s.MatchAll(n)))
}
Before I could file a bug there, however, I came across this: https://github.com/andybalholm/cascadia/issues/14
"The net/html parser parses the document as if javascript were enabled. Because of that, the contents of noscript elements are just a single text node, not parsed HTML elements."
Now it looks like the bug exists in the golang.org/x/net/html
package. Indeed, there is an open bug there for this very problem: https://github.com/golang/go/issues/16318
Sadly, it hasn't been fixed yet. :cry:
Hello Nathan,
Thanks for looking into this. Makes sense that this is at the html parser level, would be nice if it provided the option to set javascript on or off for parsing. I'll keep the issue open until some decision is made in the parser.
Martin
just noticed the same issue :)
For those looking for a workaround, re-parsing the content of the noscript tag seems to do the trick.
s.Find("noscript").SetHtml(s.Find("noscript").Text())
@machinae cool thanks i will try it :) do you have an example which i can run?
s
in my example is any *goquery.Selection
. Just add that line after loading the document
package main
import (
"fmt"
"strings"
"github.com/PuerkitoBio/goquery"
)
const data = `<noscript><a href="http://example.org">click this link</a></noscript>`
func main() {
d, err := goquery.NewDocumentFromReader(strings.NewReader(data))
if err != nil {
fmt.Println(err)
return
}
d.Find("noscript").SetHtml(d.Find("noscript").Text())
a, ok := d.Find("noscript a").Attr("href")
fmt.Printf("URL: '%s', %t\n", a, ok)
}
@machinae wouldn't this set the contents of the first noscript
as the text of all noscript tags combined? I would think getting the instance of the tag would be safer? (not tested)
d.Find("noscript").Each(func(i int, s *goquery.Selection) {
s.ReplaceWithHtml(s.Text())
})
(I don't use goquery so the above is just a guess)
I resolved it with code below:
root := doc.Selection
root.Find(`noscript`).Each(func(i int, selection *goquery.Selection) {
selection.SetHtml(selection.Text())
})
Looks like there was a partial fix in the referenced issue, i.e. ParseOptionEnableScripting(bool) which would support disabling script emulation mode. From the last issue comment it only work when noscript
is inside the head though.
Consider the following program:
The expected output is:
But instead the output is:
Changing
noscript
todiv
in both the document and selector causes the expected output, so the problem seems to affect only<noscript>
elements.