Closed biox closed 1 year ago
Hi! Thanks for reporting this. I'm not sure I quite understand the issue. Just to check, https://trash.j3s.sh/bad-feed.xml
contains invalid XML, because it contains an HTML-escaped field. It seems to me that this is correct behaviour; the feed is malformed, so the library returns an error.
If you would like to modify the response body, that should still be possible using FetchByFunc
. You could have a FetchFunc
that uses the HTTP client to request the resource, read the response body to correct the error, then store the corrected response body in the response by overriding response.Body
with a bytes.Reader
wrapped in an io.NopCloser
.
makes sense! i'm going to do what you suggested and hack around it in my app - quoting my friend lobo:
yeah, XML only has named character escapes for quot, apos, amp, gt and lt (a.k.a. the characters that have syntactical meaning), I don't think there would be any security risks on unescaping other named entities, but it will (obviously) have the weight of having to embed the tables to do so, etc.
the idea was less about complying exactly with the XML spec & more about dealing with feeds that have inadvertently included escaped HTML. i definitely won't be chasing down every person with a malformed feed!
but knowing that i can do this myself with FetchByFunc
helps a lot, thanks so much for that context! it's kind of the best of both worlds - i can pre-process feeds using unescapeHTML, and you can comply to the XML spec by default :D
Cool, I'm glad we have a way forward. If fixing this with a FetchFunc
doesn't work, or if you find any other issues, please do open more Issues.
for anyone else who needs to do this, here's how I did it:
var fetchFunc = func(url string) (*http.Response, error) {
client := http.DefaultClient
resp, err := client.Get(url)
if err != nil {
return nil, err
}
bodyBytes, err := io.ReadAll(resp.Body)
if err != nil {
return nil, err
}
resp.Body.Close()
t := html.UnescapeString(string(bodyBytes))
resp.Body = io.NopCloser(bytes.NewReader([]byte(t)))
return resp, nil
}
Looks good! It's worth checking but it may be more efficient to use a strings.Reader
instead of converting the string output to a byte slice. In other words, replace the penultimate line with:
resp.Body = io.NopCloser(strings.NewReader(t))
what are your thoughts on making the xml Decoder less strict? see https://pkg.go.dev/encoding/xml#Decoder
there are tons of blogs with these common mistakes in-place, and i don't think i can hack around the strict xml decoder without doing a bunch of work manipulating their response (but hoping i don't bust valid things)
or perhaps exposing an option to make xml parsing less strict, without doing it by default?
Sorry, I forgot about this. I'm struggling to think how to make this work with the existing API. If I were making this library from scratch, I'd take a different approach, but I'm reluctant to make major API changes now. If it's absolutely necessary to support incorrect feeds, it may be necessary to fork/patch the library. Sorry I can't help more.
no worries! i appreciate it.
hi! i wrote https://vore.website, which uses this library internally to fetch rss/atom feeds.
i ran into an issue recently where certain feeds containing escaped HTML causes the following failure:
panic: XML syntax error on line 4: invalid character entity –
here's a minimal reproducible example:
note that this is triggered by the following XML:
i'm wondering if it might make sense to unescape the HTML prior to processing to avoid this? unfortunately i don't think that i can do that kind of pre-processing using
FetchByFunc
, because i need to modify the returned Body.