cheeriojs / cheerio

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.
https://cheerio.js.org
MIT License
28.73k stars 1.64k forks source link

Cheerio encodes HTML entities too eagerly #4045

Open atjn opened 3 months ago

atjn commented 3 months ago

Take this simple HTML link:

<a href="https://example.com/?foo=1&bar=1">link</a>

Now run it through this basic script:

const $ = cheerio.load(`<a href="https://example.com/?foo=1&bar=1">link</a>`);

console.log($.html());

The output is: (I manually removed the extra \<body> and \<html> tags)

<a href="https://example.com/?foo=1&amp;bar=1">link</a>

Notice that the link in the output is incorrect because the & has been replaced with &amp;. If you try to use the output link, it will not set the same query parameters as the original link did.

I think we can all agree that when you load an html document, and then immediately render it without making any changes, the output should be identical and not suddenly have broken links.

I am not sure what needs to be different to support this, but something in the dom-serializer package needs to change. Maybe it should ignore string content, or maybe it shouldn't encode HTML entities by default?

atjn commented 3 months ago

According to #4029 this is "works as intended". I do still want to keep this issue open though, because I still think it would be useful if this kind of automatic escaping did not happen. Is there any chance that we could have something like that? Maybe just as an option?

nwalters512 commented 3 months ago

Note that I'm not actually a maintainer of Cheerio, so I don't speak for them, I'm just trying to be helpful.

Cheerio is not producing a broken link. What Cheerio produces is 100% valid HTML that will be understandable by any browser, parser, etc. that follows the HTML specification. The fact that attributes can contain character references is an inherent part of the HTML spec (https://html.spec.whatwg.org/multipage/syntax.html#attributes-2). If you have a raw HTML document and you try to use an attribute value verbatim without first parsing it per the HTML specification, you're going to have a bad time. As I noted on the other issue, Cheerio will happily give you the decoded value if you use .attr('href') or the like.

atjn commented 3 months ago

@nwalters512 thanks for trying to help. I understand that what Cheerio does is technically compliant, but in my use case, it seems bad.

I want to use Cheerio to edit an existing HTML file which will later be touched by humans. If I am a developer working on the file that Cheerio spat out, I would be really confused to see escape characters in my URLs. Not only is it hard to mentally parse, if I try to copy paste the link into a browser, or if have a fancy code editor where I can click to open the link, I will be taken to the wrong URL because the browser doesn't attempt to perform HTML-decoding on a URL that is provided directly by the user. This would make me pretty frustrated and prompt me to manually change all the HTML encoded characters to their original counterparts. That would last a few hours until someone uses Cheerio to edit the file again.