Closed spaceemotion closed 6 months ago
Hmmm, that is an interesting problem! Maybe you could allow all tags, then use a hook to catch the ones that are not really tags / not on the default allow-list and encode them?
By doing so, you would still sanitize and allow the weird HTML-like constructs that look like tags but no info gets lost as they get encoded.
@cure53 good idea! how would i have to change the config to allow so?
I am not super-sure what the most elegant way would be, but I think it can be done with two hooks, one being beforeSanitizeElements
an then uponSanitizeElement
. That should be the right timing window, me thinks.
Background & Context
I have random text as an input, it can be pure HTML or Markdown with embedded HTML. Sometimes, people use the
<
and>
symbols for emphasis, like:<Alert> should show.
or<Need to see what's up with that.>
. The expected final output should have replaced them with their HTML entity counter-parts.The input text is also language-independent, but I noticed that it's not a problem with Russian or Chinese or when an umlaut appears in the first word.
This is how my current config looks like:
Bug
Input
(see examples above)
Given output
DOMPurify removes the text between
<
and>
completely, as it interprets the tag names asalert
andneed
respectively. for example:Expected output
I know that i need to pre-process this kind of text, or need to intercept something here. I tried coming up with a regex to replace the
<
and>
with>
and<
respectively, but I can't seem to find a good match that doesn't also target regular tags like<p>
.The best I could do was
/<(\p{Lu}[^>]+?)>/gu
but that still creates too many false replacementsFeature
I would like to be able to better control what DOMPurify should do when it encounters a tag name. The tag name check only seems to handle HTML custom elements (which require a dash, which is correct, e.g.
my-button
). TheuponSanitizeElement
hook also doesn't seem offer a way to control what should be done with the element itself.