escaped HTML within XML causes feed parse failure

biox commented 1 year ago

hi! i wrote https://vore.website, which uses this library internally to fetch rss/atom feeds.

i ran into an issue recently where certain feeds containing escaped HTML causes the following failure: panic: XML syntax error on line 4: invalid character entity –

here's a minimal reproducible example:

package main

import (
    "github.com/SlyMarbo/rss"
)

func main() {
    _, err := rss.Fetch("https://trash.j3s.sh/bad-feed.xml")
    if err != nil {
        panic(err)
    }
}

note that this is triggered by the following XML:

    <title>&ndash; feed with html escaped stuff</title>

i'm wondering if it might make sense to unescape the HTML prior to processing to avoid this? unfortunately i don't think that i can do that kind of pre-processing using FetchByFunc, because i need to modify the returned Body.

SlyMarbo commented 1 year ago

Hi! Thanks for reporting this. I'm not sure I quite understand the issue. Just to check, https://trash.j3s.sh/bad-feed.xml contains invalid XML, because it contains an HTML-escaped field. It seems to me that this is correct behaviour; the feed is malformed, so the library returns an error.

If you would like to modify the response body, that should still be possible using FetchByFunc. You could have a FetchFunc that uses the HTTP client to request the resource, read the response body to correct the error, then store the corrected response body in the response by overriding response.Body with a bytes.Reader wrapped in an io.NopCloser.

biox commented 1 year ago

makes sense! i'm going to do what you suggested and hack around it in my app - quoting my friend lobo:

yeah, XML only has named character escapes for quot, apos, amp, gt and lt (a.k.a. the characters that have syntactical meaning), I don't think there would be any security risks on unescaping other named entities, but it will (obviously) have the weight of having to embed the tables to do so, etc.

the idea was less about complying exactly with the XML spec & more about dealing with feeds that have inadvertently included escaped HTML. i definitely won't be chasing down every person with a malformed feed!

but knowing that i can do this myself with FetchByFunc helps a lot, thanks so much for that context! it's kind of the best of both worlds - i can pre-process feeds using unescapeHTML, and you can comply to the XML spec by default :D

SlyMarbo commented 1 year ago

Cool, I'm glad we have a way forward. If fixing this with a FetchFunc doesn't work, or if you find any other issues, please do open more Issues.

biox commented 1 year ago

for anyone else who needs to do this, here's how I did it:


var fetchFunc = func(url string) (*http.Response, error) {
        client := http.DefaultClient
        resp, err := client.Get(url)
        if err != nil {
                return nil, err
        }
        bodyBytes, err := io.ReadAll(resp.Body)  
        if err != nil {  
                return nil, err
        }
        resp.Body.Close()

        t := html.UnescapeString(string(bodyBytes))
        resp.Body = io.NopCloser(bytes.NewReader([]byte(t)))  

        return resp, nil  
}

SlyMarbo commented 1 year ago

Looks good! It's worth checking but it may be more efficient to use a strings.Reader instead of converting the string output to a byte slice. In other words, replace the penultimate line with:

resp.Body = io.NopCloser(strings.NewReader(t))

biox commented 1 year ago

what are your thoughts on making the xml Decoder less strict? see https://pkg.go.dev/encoding/xml#Decoder

there are tons of blogs with these common mistakes in-place, and i don't think i can hack around the strict xml decoder without doing a bunch of work manipulating their response (but hoping i don't bust valid things)

biox commented 1 year ago

or perhaps exposing an option to make xml parsing less strict, without doing it by default?

SlyMarbo commented 1 year ago

Sorry, I forgot about this. I'm struggling to think how to make this work with the existing API. If I were making this library from scratch, I'd take a different approach, but I'm reluctant to make major API changes now. If it's absolutely necessary to support incorrect feeds, it may be necessary to fork/patch the library. Sorry I can't help more.

biox commented 1 year ago

no worries! i appreciate it.

SlyMarbo / rss

escaped HTML within XML causes feed parse failure #83