apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.79k stars 353 forks source link

store options (especially regexps) as JSON #550

Closed kussmaul closed 2 years ago

kussmaul commented 2 years ago

The problem to solve

I use sanitize-html in an MEAN (Mongo Express Angular Node) application, and the sanitize-html options are flexible and powerful. For example:

sanitizeHtml : {
  allowedStyles : {
    '*' : {
      'text-align'    : [ /^(left|center|right)$/ ],
      'float'         : [ /^(none|left|right|initial|inherit)$/ ],
    }
  }
}

I want to load options at run time rather than build time, so they can be edited (e.g., to tweak tags, attributes, and iframe hostnames) without rebuilding.

Proposed solution

Before, I used a typescript file (config.ts) which included sanitizeHtml options but was loaded at build time. Now, I use a JSON file (config.json5) in the assets folder so it can be edited and loaded at runtime. The main problem I've found is regular expressions, which JSON must store as strings, AFAIK:

sanitizeHtml : {
  allowedStyles : {
    '*' : {
      'text-align'    : [ '/^(left|center|right)$/' ],
      'float'         : [ '/^(none|left|right|initial|inherit)$/' ],
    }
  }
}

However, now the application needs to find and convert each regexp. I think this would involving walking the data structure, looking for strings that start with "/^" and end with "$/", and converting each to a regexp. Is this something that would be generally useful to others? Are there better ways to load options at runtime?

Thank you for a great package and your feedback!

kussmaul commented 2 years ago

This JSON5 issue has good discussion & references, explaining why this gets complicated: https://github.com/json5/json5/issues/91 [Consider serializing RegExp objects to strings]

This SO post describes how to customize JSON's stringify() with replacer and parse() with reviver arguments: https://stackoverflow.com/questions/12075927/serialization-of-regexp

In my situation, I need to convert regexp strings to regexps, every regexp starts with /^ and ends with $/, and no other strings match this pattern. In other contexts, it might be better to require a unique prefix (e.g. __REGEXP__) as in the SO post.

/**
 * Revive function for JSON.parse() that converts each regexp string to RegExp.
 * @param val   JSON value to revive.
 * @returns     Value with converted regexp strings  
 */
function reviveRE(val : any) : any {
  if (Array.isArray(val)) { return val.map(reviveRE); }
  if (typeof val === 'string') {
    const m = val.match(/\/(\^.*\$)\//); // FIXME: adjust regexp for your situation
    if (2 == m?.length) { return new RegExp(m[1]); }
  }
  return val;
}

const cfg   = JSON5.parse(txt, (_k : string, v : any) => reviveRE(v));
kussmaul commented 2 years ago

As pointed out by @jordanbtucker, JSON5 already handles arrays, so the first line of reviveRE is unnecessary.

jordanbtucker commented 2 years ago

@kussmaul You should probably update your regex to /^\/(\^.*\$)\/$/, since your version will find any strings that contain /^ and $/ anywhere within them (e.g. 'This string contains a matching regex /^abc$/ but it is not a matching regex itself.')

If you're looking for a more robust RegExp reviver function, including one that supports flags, see my regexp-reviver.js gist. It was written for JSON5, but it works just as well with JSON.

kussmaul commented 2 years ago

@jordanbtucker thank you for the help writing a regex to match a regex. :-)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.