apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.68k stars 349 forks source link

Extra tags being added #667

Closed arasfeld-seamless closed 5 days ago

arasfeld-seamless commented 5 days ago

PLEASE NOTE: make sure the bug exists in the latest patch level of the project. For instance, if you are running a 2.x version of Apostrophe, you should use the latest in that major version to confirm the bug.

To Reproduce

const input = '<p>Lorem <strong>ipsum</strong> <pre>dolorem</pre> <foo>sic</foo> amet</p>';
const sanitizedString = sanitizeHtml(input, { allowedTags: ['p'] });

sanitizedString = <p>Lorem ipsum </p>dolorem sic amet<p></p>;

Expected behavior

Sanitized string should remove all non-p tags

Should return <p>Lorem ipsum dolorem sic amet</p>

Describe the bug

A </p> gets inserted before "dolorem" and a <p> gets inserted after "amet"

Details

Version of Node.js: 18

Server Operating System: MacOSX

Additional context:

Add any other context about the problem here. If the problem is specific to a browser, OS or mobile device, specify which.

Screenshots If applicable, add screenshots to help explain your problem.

BoDonkey commented 5 days ago

Confirmed - it seems that nested pre tags are replaced with p tags. Top level pre tags are eliminated correctly if disallowed.

boutell commented 5 days ago

Apparently this is not a bug, it's behavior coming from htmlparser2 which is the HTML parser upstream of us. And that module is behaving reasonably given the rules of HTML:

"Paragraphs are block-level elements, and notably will automatically close if another block-level element is parsed before the closing

tag. See "Tag omission" below."

See:

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/p

So what happened is this:

  1. htmlparser2 saw "pre" inside "p". htmlparser2 said "that's not a thing (see above), but I'll do the most reasonable thing I can: I'll close this p element, output the pre element, and then open a new p element."

  2. sanitize-html stripped the "pre" element, per your configuration.

Probably we can't change this at the sanitize-html level, and shouldn't because htmlparser2 isn't doing anything wrong really.

On Fri, Jun 28, 2024 at 1:13 PM Robert Means @.***> wrote:

Confirmed - it seems that nested pre tags are replaced with p tags. Top level pre tags are eliminated correctly if disallowed.

— Reply to this email directly, view it on GitHub https://github.com/apostrophecms/sanitize-html/issues/667#issuecomment-2197350014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAH27PUACWY2R4YF3MIKH3ZJWKS3AVCNFSM6AAAAABKCHL5EOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXGM2TAMBRGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

THOMAS BOUTELL | CHIEF TECHNOLOGY OFFICER APOSTROPHECMS | apostrophecms.com | he/him/his