How to bypass characters like less than character when sanitising the data

piyush-ally commented 3 years ago

Loofah.scrub_fragment("<hello message", :strip).to_text returns empty string

When there are no closing tags why is <hello getting removed. This seems like an incorrect behaviour. Is there a way to bypass this and return <hello message in such cases?

Let me know if any more information is required from my side.

flavorjones commented 3 years ago

Hi! Thanks for asking this question.

This code snippet uses #scrub_fragment which does two things:

parse the fragment into a Nokogiri::HTML::DocumentFragment
sanitize that DocumentFragment

Let's separate these two operations to see what's going on ...

Loofah.fragment("<hello message").children
# => [#<Nokogiri::XML::Element:0x2bc name="hello" attributes=[#<Nokogiri::XML::Attr:0x2d0 name="message">]>]

Interesting: Nokogiri parses that fragment into a <hello></hello> element. Why is that? Nokogiri (actually, libxml2) treats this as a "markup error" and tries to fix it:

Loofah.fragment("<hello message").errors
# => 
# [#<Nokogiri::XML::SyntaxError: 1:27: ERROR: Tag hello invalid>,
#  #<Nokogiri::XML::SyntaxError: 1:27: ERROR: Couldn't find end of Start Tag hello>]

If your intention is to have this string interpreted as a "text node" that equals <hello message you should be aware that a bare < in an HTML text node is considered malformed, and you should use < instead. You may want to consider HTML-escaping anything that's a text node before passing it to Loofah:

CGI.escapeHTML("<hello message")
# => "&lt;hello message"
Loofah.fragment(CGI.escapeHTML("<hello message"))
# => #(DocumentFragment:0x3d4 { name = "#document-fragment", children = [ #(Text "<hello message")] })
Loofah.fragment(CGI.escapeHTML("<hello message")).to_html
# => "&lt;hello message"

The <hello> element is being removed by the Strip scrubber. The documentation says:

+:strip+ removes unknown/unsafe tags

Is <hello></hello> a known and safe tag? Let's look at the code:

https://github.com/flavorjones/loofah/blob/369a54f97ce8effd797176cdaade7c2ac58e738b/lib/loofah/scrubbers.rb#L96-L102

which calls html5lib_sanitize:

https://github.com/flavorjones/loofah/blob/369a54f97ce8effd797176cdaade7c2ac58e738b/lib/loofah/scrubber.rb#L103-L106

which calls allowed_element?:

https://github.com/flavorjones/loofah/blob/369a54f97ce8effd797176cdaade7c2ac58e738b/lib/loofah/html5/scrub.rb#L16-L18

Which uses ALLOWED_ELEMENTS_WITH_LIBXML2 -- basically this allowlist which hello is not a member of:

https://github.com/flavorjones/loofah/blob/369a54f97ce8effd797176cdaade7c2ac58e738b/lib/loofah/html5/safelist.rb#L49-L144

If we use something in the list instead, like audio, we see Loofah keeps it around:

Loofah.fragment("<audio message").scrub!(:strip).to_html
# => "<audio></audio>"

I hope that makes sense!

piyush-ally commented 3 years ago

Thank you @flavorjones for an amazing explanation of underlying code.

flavorjones / loofah

How to bypass characters like less than character when sanitising the data #218