apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.79k stars 353 forks source link

Return true/false if sanitization required or not #498

Closed cyberfox1 closed 3 years ago

cyberfox1 commented 3 years ago

The problem to solve

Sometimes you just want to know if the HTML fails the sanitization check, rather than actually produced cleaned HTML, e.g. to simply reject the HTML.

Proposed solution

An option passed to the module that causes it to return true/false depending on whether the HTML was considered "valid" or not - valid==true==did not require sanitization.

Alternatives

String comparing the cleaned response from module to the original HTML - however this seems hacky.

Additional context

If this option already exists it is not obvious.

boutell commented 3 years ago

This module is not a validator and it relies on a parser that is inherently tolerant, so I don't think that's a direction that would make sense for this module. I would recommend looking for a dedicated validator.

cyberfox1 commented 3 years ago

I did not suggest HTML validation in the general sense, but in the sense that it does not contain elements that would have required sanitation. Providing a function that performs some kind of manipulation, without letting the caller know if manipulate was required or occurred or not appears to be an omission. And in any case there are use cases in which it would be useful. What if you simply wanted to display an indicator saying some elements had been removed? Seems weird this would be seen as out of scope for this module.

abea commented 3 years ago

@cyberfox1 Could you provide a concrete example?

cyberfox1 commented 3 years ago

Well suppose this module does almost everything I want in term of being able to specify valid/invalid tags and other options etc, however rather than only producing the sanitized HTML, it also return a flag saying if such sanitization was necessary, so I can handle it differently. I might reject it, or mark it with metadata so it can be displayed with a flag, or maybe treating HTML that required sanitization as suspicious and storing it in a different location. I don't have any code yet if that is what you mean by a concrete example.

abea commented 3 years ago

Yeah, it sounds like you'd be talking about changing the output from a string of HTML to an object, or possibly just a boolean. I was looking to understand what you were looking for better. It could be something that you could accomplish with loose rules and the transform or filter functions instead.

cyberfox1 commented 3 years ago

Or it could be an option to change the response structure, or probably even better a whole new function that can be imported from the module that has the boolean return semantics. I would not want to change the default behavior as that would be disruptive. Or an option to pass in a structure that the module sets with things like counts of changes for various kind of elements, and a total_changes count that would be zero if no invalid elements were detected.

I can't find any similar validator module that allows valid/invalid tags to be specified and returns a boolean value.

boutell commented 3 years ago

I get where you're coming from. I see how it might be nice to instrument the module to give some information on what it's doing.

My main concern however is that because we're relying on an HTML parser that is tolerant and just tells us what structure it found, not exactly what original markup it found, the output is frequently going to be different from the input in ways that our module can't detect at all. And this could include things that count as necessary, to some. But it would never be caught by the "validation detection" code. So I think it would be difficult to deliver this in a way that developers would find satisfying.

On Wed, Sep 29, 2021 at 6:37 PM cyberfox1 @.***> wrote:

Or it could be an option to change the response structure, or probably even better a whole new function that can be imported from the module that has the boolean return semantics. I would not want to change the default behavior as that would be disruptive. Or an option to pass in a structure that the module sets with things like counts of changes for various kind of elements, and a total_changes count that would be zero if no invalid elements were detected.

I can't find any similar validator module that allows valid/invalid tags to be specified and returns a boolean value.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/apostrophecms/sanitize-html/issues/498#issuecomment-930596183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAH27L54YPJTQTS6MBM3MDUEOILRANCNFSM5E6Q3UCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

THOMAS BOUTELL | CHIEF TECHNOLOGY OFFICER APOSTROPHECMS | apostrophecms.com | he/him/his