flavorjones / loofah

Ruby library for HTML/XML transformation and sanitization
MIT License
934 stars 136 forks source link

Ruby 1.9.1 and loofah 0.2.2 Encoding error. Ruby 1.8.7 is OK. #7

Closed gferguson closed 15 years ago

gferguson commented 15 years ago

Just got this under Ruby 1.9.1 while parsing http://www.fd.nl/nieuws/laatstenieuws/?view=RSS

/home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/fragment_handler.rb:42:in characters': incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) (Encoding::CompatibilityError) from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:innative_parse_memory' from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:in parse_memory' from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/sax/parser.rb:83:inparse' from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:11:in initialize' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:innew' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:in parse' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:179:infragment' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:184:in scrub_fragment' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1362:instrip_html' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:442:in get_feed_summary' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:888:inupdate_rss_feed' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1130:in `update_feed'

The strip_html() method looks like this...

class String def strip_html html = Nokogiri::HTML.fragment(self.dup) (html/:br).each {|_br| _br.swap(' ') } (html/:p).each {|_p| _p.swap(_p.content + ' ') } Loofah.scrub_fragment(html.content, :strip).text end end

The text it's working on is:

fd.nl - De nieuwsbron van ondernemend Nederland. Financieel-economisch nieuws, achtergronden en analyses. Artikelen over beleggen, carrière en ondernemen. Uitgebreide koersen, rentestanden en bedrijfsinformatie. Met veel aandacht voor aandelenmarkten en beleggingsfondsen.
gferguson commented 15 years ago
 <description>fd.nl - De nieuwsbron van ondernemend Nederland. Financieel-economisch nieuws, achtergronden en analyses. Artikelen over beleggen, carri&#xE8;re en ondernemen. Uitgebreide koersen, rentestanden en bedrijfsinformatie. Met veel aandacht voor aandelenmarkten en beleggingsfondsen.</description>
flavorjones commented 15 years ago

Looking into this now. Sorry for the delay.

flavorjones commented 15 years ago

Hi! This is actually a Nokogiri bug that is fixed in master on github. The commit fixing this is ed9f8424da77631a149c38de55fa2100a4cf95f1.

You can grab the Nokogiri nightly builds by running:

$ sudo gem install nokogiri -s http://tenderlovemaking.com

This version of Nokogiri should be released in the next few weeks.

flavorjones commented 15 years ago

Closing