flavorjones / loofah

Ruby library for HTML/XML transformation and sanitization
MIT License
934 stars 138 forks source link

unclosed html tags are also being pruned off, ideal expectation is to have only closed tags pruned #254

Closed anil-adepu closed 1 year ago

anil-adepu commented 1 year ago

Loofah.scrub_fragment("This is <sometext>", :strip).text

flavorjones commented 1 year ago

Hi! Thanks for asking this question. Let me explain what's going on ...

doc = Nokogiri::HTML4::DocumentFragment.parse("This is <sometext>")

doc.to_html
# => "This is <sometext></sometext>"

Nokogiri's HTML parser sees this tag and auto-corrects it with a closing tag. This is how the parser is configured for HTML, and it mimics what many browsers do when they encounter invalid or ill-formed HTML.

Loofah then sits on top of that Nokogiri object. When it strips invalid tags, that tag gets stripped.

Keep in mind that bare < and > are invalid outside of CDATA contexts like <script> or <style>.

You haven't explained your use case, but here are a few options for you to consider:

  1. if you have control of this content, make sure you use HTML entities instead of bare < and >:
Loofah.scrub_fragment("This is &lt;sometext&gt;", :strip).to_html
# => "This is &lt;sometext&gt;"
  1. If you don't have control, but you want tags to appear in the output, use :escape instead of :strip:
Loofah.scrub_fragment("This is <sometext>", :escape).to_html
# => "This is &lt;sometext&gt;&lt;/sometext&gt;"

I'm going to close this issue, but if you have more questions I'm happy to answer them.

anil-adepu commented 1 year ago

Hi @flavorjones, thanks for the reply. Sorry to have missed out on the exact use case. Let me elaborate on the issue in detail:

Providing some sample input string below to mimic user-inputted content:

"<script>unsafe script inserted</script> Maintaining some metric under <xxxx dollars for this Q" where xxx is of CDATA type content

None of :strip or :escape scrubbers were helping here solely due to the self-closing tags being added from Nokogiri's HTML parsing

  1. :strip is generating output as below:

     Loofah.scrub_fragment("<script>unsafe script inserted</script> Maintaining some metric under <xxxx dollars for this Q", :strip).text(encode_special_chars: false)
     # => "unsafe script inserted Maintaining some metric under"
  2. :escape as:

     Loofah.scrub_fragment("<script>unsafe script inserted</script> Maintaining some metric under <xxxx dollars for this Q", :escape).text(encode_special_chars: false)
     # => "<script>unsafe script inserted</script> Maintaining some metric under <xxxx dollars for this q></xxxx>"
  3. For the above use case, we are looking to generate sanitized content without the additional self-closing tags coming from HTML parsing

    "unsafe script inserted Maintaining some metric under <xxxx dollars for this Q"

Wondering if we can patch on top of Nokogiri's DocumentFragment.parse to skip closing/formatting unclosed HTML tags (which are harmless anyway) and treat them as regular text content.

flavorjones commented 1 year ago

Again, the behavior you're seeing is from libxml2's HTML4 parser, so there's very little we can do in Nokogiri or Loofah to change that behavior if we wanted to.

The key point is that you're passing ill-formed HTML to Loofah and Nokogiri. We could debate a bit about the "right" way to autocorrect ill-formed markup, since the HTML4 spec doesn't specify how it should be done, but ...

Notably, though, Nokogiri's HTML5 parser (libgumbo) which follows the HTML5 spec behaves similarly. The HTML5 spec is very specific about how it thinks ill-formed markup should be recovered, and we wouldn't want to change that behavior.

My point, I guess, is Loofah is supposed to make the content safe for rendering in a browser. It's doing that in this case. It's not doing in the way you wish it did, but then again I'm not sure how what you're proposing could be made safe.

flavorjones commented 1 year ago

The other option you should consider, if what you're saying is that the string should be considered as "text" and not HTML, is to just call CGI.escapeHTML on the string.

flavorjones commented 1 year ago

Also please don't use the encode_special_chars option on untrusted input as then you're no longer sanitizing the output.

flavorjones commented 1 year ago

Sorry, one latest comment, and I'm sorry to keep writing but your question and response are making me nervous because I think you're trying to do something that's unsafe.

In your original description you say the desired output includes the bytes this is <sometext> without the closing tag. If you manage to do this, I think you'll see that the browser will follow the html5 spec and auto close it, and make it a sometext element in the DOM, and the user will just see this is. Does that make sense? If not, please go through the exercise of putting those bytes in a static HTML page and render it in your browser.

I don't think this is what you want. I think what you want is to emit this is &lt;sometext&gt; so that the browser renders something that the user sees as this is <sometext>.

Please don't conflate the byte stream with what you want the user to see in a browser-rendered DOM. This is why it is so dangerous to use "encode_special_chars".

anil-adepu commented 1 year ago

Thanks for the correction, right, the expectation is to emit this is &lt;sometext&gt; and render the user-inputted content in the text editor as is (which I guess should be fine as they are unsafe) instead of auto-closing and treating it as HTML. This issue is becoming more apparent as we are treating the user-inputted text (text essentially containing <some CDATA> type strings) which is being treated as unclosed/ill-formatted HTML content and parsing it with libgumbo/libxml2 (where the autocorrelation logic is falling in place). Doing this as per html5 spec is totally justified but from a UX standpoint, it is likely to be safe content from a user's perspective. This was the exact pain point for us. So trying to find a solution to achieve this where the unclosed tags can be skipped and safely treated as plain text and the rest of the logic can be followed as is (scrubbing off unsafe / rightly-formatted HTML tags).

Also, regarding the usage of "encode_special_chars" we are doing, it looks like this is into action upon the scrubbed content where essentially all of the unsafe content is stripped off and this should be unsafe, cmiiw; so trying to understand how doing this would be "dangerous" as you have previously mentioned.

flavorjones commented 1 year ago

If the user content is supposed to be text, then I would treat it as text and use GCI.escapeHTML instead of an HTML sanitizer. An HTML sanitizer is only really needed if the untrusted content is actually intended to be interpreted as a DOM structure by the browser. Does that make sense?

With respect to encode_special_chars: false, here's the scenario:

  1. Untrusted user input is the string "hello &lt;script&gt;alert(1);&lt;/script&gt; there"
  2. You scrub it:
Loofah.fragment("hello &lt;script&gt;alert(1);&lt;/script&gt; there").scrub!(:strip)
# => 
# #(DocumentFragment:0xdf84 {                                                                                           
#   name = "#document-fragment",                                                                                        
#   children = [ #(Text "hello <script>alert(1);</script> there")]                                                      
#   })
  1. You then serialize it, choosing not to encode special character entities:
Loofah.fragment("hello &lt;script&gt;alert(1);&lt;/script&gt; there").scrub!(:strip).text(encode_special_chars: fal
se)
# => "hello <script>alert(1);</script> there"
  1. when you stick that string into your webpage, the javascript runs in the browser

Instead, you should just serialize it with the defaults:

Loofah.fragment("hello &lt;script&gt;alert(1);&lt;/script&gt; there").scrub!(:strip).text
# => "hello &lt;script&gt;alert(1);&lt;/script&gt; there"

which, when it arrives in the browser, will be rendered to the user's eyeballs as "hello <script>alert(1);</script> there" and not executed as javascript.

Make sense?

anil-adepu commented 1 year ago

Thanks. Got it, understood. But, the use-case we discussed is to remove unsafe content but render the required user-inputted string as is. Let's take a sample string something like below to understand more:

"hello there, <script>alert(1); some unsafe script injected</script> logging revenue: <million dollars for this Q."

Now, the requirement here is to strip/prune the unsafe script tag and render the rest of everything as is. Expected: "hello there, logging revenue: <million dollars for this Q." (with :prune)

  1. Treating the content as text and CGI.escapeHTML on this would give: "hello there, &lt;script&gt;alert(1); unsafe script injected&lt;/script&gt; logging revenue: &lt;million dollars for this Q."

Doing the above would render the content as is (same as input) to the user but the script tag isn't pruned off as the input is escaped for HTML content.


  1. Whereas, treating input as html content and pruning off with HTML sanitizer:
    Loofah.fragment("hello there, <script>alert(1); some unsafe script injected</script> logging revenue: <million dollars for this Q.").scrub!(:prune).text

    this gives, "hello there, logging revenue: " but the user-inputted substring "<million dollars for this Q." is pruned off too! (which looks like data loss from a user perspective)

So looking for a way to achieve the expected output as, using html parser solely for this case is auto-closing this ...<million... input too which is eventually getting pruned off along with unsafescript tags, whereas using CGI.escapeHTML is retaining the required substring ...<million... as expected but is also showing unsafe content, the script tag.

Hoping the issue is clear now?

flavorjones commented 1 year ago

So looking for a way to achieve the expected output as ...

Let me be crystal clear: you're asking for something that no HTML parser does. Nokogiri and Loofah will not do this for you.

Sanitizers' job is to make unsafe content safe, that's it. You're asking for something else, something aesthetic, which is to treat < as a start-of-tag character if it's starting a well-formed tag, but to treat it as a plaintext character if it's not. No HTML parser I know of will do this for you, and I've been trying to communicate to you the why behind that. I'm sorry if I've not done that well.

You need to choose which behavior is an acceptable compromise. And I'm begging you not to choose the option that makes your site insecure.

flavorjones commented 1 year ago

And I'll say it one more time just for the record: please do not use encode_special_chars: false if you're rendering for a web browser. If you do that, you will be vulnerable to XSS attacks. This option is intended for rendering in textual contexts like SMS messages, not HTML contexts like web browsers.

anil-adepu commented 1 year ago

Makes sense now, understood. Thanks a ton for bearing with this!