Closed piyush-ally closed 3 years ago
Hi! Thanks for asking this question.
This code snippet uses #scrub_fragment
which does two things:
Nokogiri::HTML::DocumentFragment
DocumentFragment
Let's separate these two operations to see what's going on ...
Loofah.fragment("<hello message").children
# => [#<Nokogiri::XML::Element:0x2bc name="hello" attributes=[#<Nokogiri::XML::Attr:0x2d0 name="message">]>]
Interesting: Nokogiri parses that fragment into a <hello></hello>
element. Why is that? Nokogiri (actually, libxml2) treats this as a "markup error" and tries to fix it:
Loofah.fragment("<hello message").errors
# =>
# [#<Nokogiri::XML::SyntaxError: 1:27: ERROR: Tag hello invalid>,
# #<Nokogiri::XML::SyntaxError: 1:27: ERROR: Couldn't find end of Start Tag hello>]
If your intention is to have this string interpreted as a "text node" that equals <hello message
you should be aware that a bare <
in an HTML text node is considered malformed, and you should use <
instead. You may want to consider HTML-escaping anything that's a text node before passing it to Loofah:
CGI.escapeHTML("<hello message")
# => "<hello message"
Loofah.fragment(CGI.escapeHTML("<hello message"))
# => #(DocumentFragment:0x3d4 { name = "#document-fragment", children = [ #(Text "<hello message")] })
Loofah.fragment(CGI.escapeHTML("<hello message")).to_html
# => "<hello message"
The <hello>
element is being removed by the Strip
scrubber. The documentation says:
+:strip+ removes unknown/unsafe tags
Is <hello></hello>
a known and safe tag? Let's look at the code:
which calls html5lib_sanitize
:
which calls allowed_element?
:
Which uses ALLOWED_ELEMENTS_WITH_LIBXML2
-- basically this allowlist which hello
is not a member of:
If we use something in the list instead, like audio
, we see Loofah keeps it around:
Loofah.fragment("<audio message").scrub!(:strip).to_html
# => "<audio></audio>"
I hope that makes sense!
Thank you @flavorjones for an amazing explanation of underlying code.
When there are no closing tags why is
<hello
getting removed. This seems like an incorrect behaviour. Is there a way to bypass this and return<hello message
in such cases?Let me know if any more information is required from my side.