Uncertain how to handle 'non-standard' HTML

spaceemotion commented 2 months ago

This issue asks a question on how to prepare mixed text for purification

Background & Context

I have random text as an input, it can be pure HTML or Markdown with embedded HTML. Sometimes, people use the < and > symbols for emphasis, like: <Alert> should show. or <Need to see what's up with that.>. The expected final output should have replaced them with their HTML entity counter-parts.

The input text is also language-independent, but I noticed that it's not a problem with Russian or Chinese or when an umlaut appears in the first word.

This is how my current config looks like:

const config: Config = {
  KEEP_CONTENT: true,
  FORBID_TAGS: ['script', 'style'],
  ALLOW_ARIA_ATTR: false,
  WHOLE_DOCUMENT: false,
  CUSTOM_ELEMENT_HANDLING: {
    tagNameCheck: () => true,
  },
  USE_PROFILES: {
    html: true,
  },
};

Bug

Input

(see examples above)

Given output

DOMPurify removes the text between < and > completely, as it interprets the tag names as alert and need respectively. for example:

input: "<Alert> should show."
output: " should show."

Expected output

I know that i need to pre-process this kind of text, or need to intercept something here. I tried coming up with a regex to replace the < and > with > and < respectively, but I can't seem to find a good match that doesn't also target regular tags like <p>.

The best I could do was /<(\p{Lu}[^>]+?)>/gu but that still creates too many false replacements

Feature

I would like to be able to better control what DOMPurify should do when it encounters a tag name. The tag name check only seems to handle HTML custom elements (which require a dash, which is correct, e.g. my-button). The uponSanitizeElement hook also doesn't seem offer a way to control what should be done with the element itself.

cure53 commented 2 months ago

Hmmm, that is an interesting problem! Maybe you could allow all tags, then use a hook to catch the ones that are not really tags / not on the default allow-list and encode them?

By doing so, you would still sanitize and allow the weird HTML-like constructs that look like tags but no info gets lost as they get encoded.

spaceemotion commented 2 months ago

@cure53 good idea! how would i have to change the config to allow so?

cure53 commented 2 months ago

I am not super-sure what the most elegant way would be, but I think it can be done with two hooks, one being beforeSanitizeElements an then uponSanitizeElement. That should be the right timing window, me thinks.

cure53 / DOMPurify