apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.68k stars 349 forks source link

Special Entities are encoded within attribute values #666

Closed josephrexme closed 6 days ago

josephrexme commented 1 week ago

I have searched through all existing issues and while I understand the reasoning behind using special characters in HTML always and I'm able to bypass that for text content with something like lodash's unescape i.e:

sanitize(myHTML, { textFilter: text => unescape(text) })

I had found that this a problem for attribute values as well. Ampersands in the src of an iframe are being converted from & to &

To Reproduce

Step by step instructions to reproduce the behavior:

Enable iframe tag, enable src attribute,

Try:

sanitize('<iframe src="https://myurl.com?param1=hellol&param1=hi"></iframe>')
// <iframe src="https://myurl.com?param1=hellol&amp;param1=hi"></iframe>

Expected behavior

With the response to other issues around entity encoding being based on HTML needing encoded ampersand for better validity, I don't think that applies to attribute values. They are information carriers that I believe should be preserved and exempt from this encoding.

Details

Sanitize HTML version: 2.13.0 Version of Node.js: 20.8.0

Server Operating System: MacOS

Screenshots

Screenshot 2024-06-26 at 12 26 51 PM

boutell commented 6 days ago

While HTML5 does not require escaping & if the characters that follow it cannot be mistaken for an HTML entity, there are lots of valid HTML entities, and it doesn't make sense for us to burden this module with logic to try to exhaustively make sure the characters that follow won't turn leaving & alone into a document-breaking decision. Such logic would be better kept in a dedicated HTML minification tool.

See "errors involving fragile syntax constructs" in the HTML5 specification.

josephrexme commented 6 days ago

@boutell it feels like this was an automated response because I don't see this addressing that the discussion is focused on attribute values. I'd willingly contribute by adding a PR if you're willing to discuss more on this.

I am not against the decision of the library to do this to text within HTML but attribute values should not be getting this treatment too. Or at least, there should be a way to opt out just like there's a allowEmptyAttributes (or something like that).

What do you think?

P.S I do appreciate the time you take to keep this project active and respond to a lot of the issues so quickly but I'm hoping we don't dismiss what could be a good discussion too quickly

boutell commented 6 days ago

Happy to explain a little more. The entity handling rules of HTML still apply inside attributes. If you put &quot; in an href attribute, and click that link, the server does not receive &quot;, it receives &

So if we don't pay attention to this and a valid entity escape sequence (not intended as such) winds up in there, the server is not going to get the expected data.

josephrexme commented 6 days ago

Thanks for explaining further @boutell , I appreciate it. I didn't want wrap the entire function call with unescape for my use-case earlier, but it was what I had to end up doing to get around this.

boutell commented 6 days ago

OK, just understand that this means that if something that looks like an entity is ever valid data in your query string it's going to get lost. Remember, the server is not going to get &. It's going to get &. So I am not convinced you are really arriving at the right solution.

On Thu, Jun 27, 2024 at 3:36 PM Joseph Rex @.***> wrote:

Thanks for explaining further @boutell https://github.com/boutell , I appreciate it. I didn't want wrap the entire function call with unescape for my use-case earlier, but it was what I had to end up doing to get around this.

— Reply to this email directly, view it on GitHub https://github.com/apostrophecms/sanitize-html/issues/666#issuecomment-2195527771, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAH27MBCCDMIHRTNLUMG2TZJRSUTAVCNFSM6AAAAABJ6KW266VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJVGUZDONZXGE . You are receiving this because you were mentioned.Message ID: @.***>

--

THOMAS BOUTELL | CHIEF TECHNOLOGY OFFICER APOSTROPHECMS | apostrophecms.com | he/him/his