apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.68k stars 349 forks source link

Detect if a string was filtered #571

Closed hrishikesh-k closed 1 year ago

hrishikesh-k commented 1 year ago

My question is somewhat related to: https://github.com/apostrophecms/sanitize-html/issues/292

Question or comment

TL;DR: Detect if string was modified after filtering or get the filtered content.

I'm trying to find a way to check if sanitize-html actually did some filtering on a given string. For example, I'm using the library to filter my query params as well. If the query params include the expected values like numbers or small strings (which is how my front-end would send the request, but people could always use tools like curl), it should work fine. But if someone tries to use some malicious query param value, I'm trying to check it with this library. Now, if the library actually filters something out, I wish to reject that API call instead of proceeding with the filtered value as it would most likely be an invalid or unexpected value.

For example:

https://www.example.com/api/?param1=12345

should work fine as even if I pass it through sanitize-html, it will remain unchanged.

However, if someone send a request like:

https://www.example.com/api/?param1=some-malicious-string

and sanitize-html filters something from it, I wish to stop my API from processing further.

I have considered checking the original string with the sanitized string like:

const original = request.query['param1']
const sanitized = sanitizeHtml(original)
if (original === sanitized) {
  // process
} else {
  //reject because filtered
}

But I was wondering if there's any better way to do this instead of having to filter multiple params like this. Also, when I use it with my message body, I cannot rely on this comparison, as I would expect some attributes to be stripped out. I'm using WindiCSS with Attributify mode: https://windicss.org/features/attributify.html in TipTap editor. The attributes are only for styling in the frontend, and I do not care about them in the backend, so it's okay for those to be filtered out, which is why I'm not using this library's allow-list for those (but I can if that's the only way).

I was planning to use DOMPurify and found that they had this option: https://github.com/cure53/DOMPurify#okay-makes-sense-lets-move-on (.removed which showed the filtered content), but I'm having this issue: https://github.com/kkomelin/isomorphic-dompurify/issues/54 (for non-Next.js apps running on AWS Lambda), and thus, need to use something different.

I did not find any relevant info in the docs or in the issues at the moment.

Let me know if this question/request doesn't make sense and I'd be happy to clarify further.

Details:

Version of Node.js: 16

Server Operating System: Linux

Additional context: N/A

Screenshots: N/A

boutell commented 1 year ago

There is currently no support for detecting whether sanitize-html made any changes.

Simple string comparison will fail because sanitize-html always formats attributes in a consistent way, i.e. always uses double quotes whereas the original might have single quotes or no quotes and still be valid in many cases.

I think tracking whether sanitize-html discarded anything would be a feature worth having, it would make a good pull request. It could take time to reach a complete feature set there, including support for allowing custom transformation functions etc. to report that they did something.

One problem with this idea though is that if you're OK with some changes and not others, it might not satisfy you, and there's no really universal way to check for the changes that matter to some developers and not to others.

Closing because the question has an answer, but this doesn't mean it's not a topic of interest.

carcinocron commented 1 year ago

would it theoretically work to run sanitize with no rules to get the "consistent way" for reference, then check it against the result of running sanitize with rules?