flavorjones / loofah

Ruby library for HTML/XML transformation and sanitization
MIT License
934 stars 138 forks source link

Loofah removes   #240

Closed wizardofosmium closed 2 years ago

wizardofosmium commented 2 years ago

There are times when   is actually needed. Unfortunately, Loofah removes them.

> Loofah.fragment("  !=  ").to_s
=> "  !=  "

Could you either make:

flavorjones commented 2 years ago

Hi! This unfortunately is not behavior that Loofah directly controls, it's how libxml2 parses:

>> x = Nokogiri::HTML4::DocumentFragment.parse("  !=  ")
=> 
#(DocumentFragment:0xc300 {                                              
...                                                                      
>> x.to_html
=> "  !=  "
>> x
=> 
#(DocumentFragment:0xc300 {                                              
  name = "#document-fragment",                                           
  children = [ #(Text "  !=  ")]                                    
  })                                                                     

although the gumbo parser used by Nokogiri::HTML5 is better:

>> x = Nokogiri::HTML5::DocumentFragment.parse("  !=  ")
=> 
#(DocumentFragment:0xe894 {                                              
...                                                                      
>> x.to_html
=> "  !=  "
>> x
=> 
#(DocumentFragment:0xe894 {          
  name = "#document-fragment",       
  children = [ #(Text "  !=  ")]
  })                                 

Because this behavior is inherited from libxml2, there's nothing we can easily do in Nokogiri or Loofah to change it.

Note that we're planning to update Loofah to use Nokogiri::HTML5 when it's available: https://github.com/flavorjones/loofah/pull/239 which is blocked on Nokogiri v1.14.0 being released (soon!).

wizardofosmium commented 2 years ago

Thanks for the explanation @flavorjones 👍

It looks like I'll have to hack around it with something like:

string = "  !=   or  "
protected_string = string.gsub(/ /, "PROTECTEDNBSP").gsub(/ /, "PROTECTED160")
Loofah.fragment(protected_string).to_s.gsub(/PROTECTEDNBSP/, " ").gsub(/PROTECTED160/, " ")

(Just have to hope that the input doesn't contain PROTECTEDNBSP or PROTECTED160 😬)

Any other suggestions would be welcome. Cheers!

Yegorov commented 2 years ago

Hello everyone, thanks for the answers! Loofah removes not only   Also removes — « » and others For my case it looks like this:

string = "      — « »"
protected_string = string.gsub(/&(.+?);/, '_PROTECTED\1_')
Loofah.fragment(protected_string).to_s.gsub(/_PROTECTED(.+?)_/, '&\1;')