Closed wizardofosmium closed 2 years ago
Hi! This unfortunately is not behavior that Loofah directly controls, it's how libxml2 parses:
>> x = Nokogiri::HTML4::DocumentFragment.parse("  != ")
=>
#(DocumentFragment:0xc300 {
...
>> x.to_html
=> "  != "
>> x
=>
#(DocumentFragment:0xc300 {
name = "#document-fragment",
children = [ #(Text " != ")]
})
although the gumbo parser used by Nokogiri::HTML5
is better:
>> x = Nokogiri::HTML5::DocumentFragment.parse("  != ")
=>
#(DocumentFragment:0xe894 {
...
>> x.to_html
=> "  != "
>> x
=>
#(DocumentFragment:0xe894 {
name = "#document-fragment",
children = [ #(Text " != ")]
})
Because this behavior is inherited from libxml2, there's nothing we can easily do in Nokogiri or Loofah to change it.
Note that we're planning to update Loofah to use Nokogiri::HTML5 when it's available: https://github.com/flavorjones/loofah/pull/239 which is blocked on Nokogiri v1.14.0 being released (soon!).
Thanks for the explanation @flavorjones 👍
It looks like I'll have to hack around it with something like:
string = "  != or  "
protected_string = string.gsub(/ /, "PROTECTEDNBSP").gsub(/ /, "PROTECTED160")
Loofah.fragment(protected_string).to_s.gsub(/PROTECTEDNBSP/, " ").gsub(/PROTECTED160/, " ")
(Just have to hope that the input doesn't contain PROTECTEDNBSP or PROTECTED160 😬)
Any other suggestions would be welcome. Cheers!
Hello everyone, thanks for the answers!
Loofah removes not only
Also removes —
«
»
and others
For my case it looks like this:
string = "    — « »"
protected_string = string.gsub(/&(.+?);/, '_PROTECTED\1_')
Loofah.fragment(protected_string).to_s.gsub(/_PROTECTED(.+?)_/, '&\1;')
There are times when
is actually needed. Unfortunately,Loofah
removes them.Could you either make:
Loofah
not remove them at all, or