Closed gferguson closed 15 years ago
<description>fd.nl - De nieuwsbron van ondernemend Nederland. Financieel-economisch nieuws, achtergronden en analyses. Artikelen over beleggen, carrière en ondernemen. Uitgebreide koersen, rentestanden en bedrijfsinformatie. Met veel aandacht voor aandelenmarkten en beleggingsfondsen.</description>
Looking into this now. Sorry for the delay.
Hi! This is actually a Nokogiri bug that is fixed in master on github. The commit fixing this is ed9f8424da77631a149c38de55fa2100a4cf95f1.
You can grab the Nokogiri nightly builds by running:
$ sudo gem install nokogiri -s http://tenderlovemaking.com
This version of Nokogiri should be released in the next few weeks.
Closing
Just got this under Ruby 1.9.1 while parsing http://www.fd.nl/nieuws/laatstenieuws/?view=RSS
/home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/fragment_handler.rb:42:in
characters': incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) (Encoding::CompatibilityError) from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:in
native_parse_memory' from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:inparse_memory' from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/sax/parser.rb:83:in
parse' from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:11:ininitialize' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:in
new' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:inparse' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:179:in
fragment' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:184:inscrub_fragment' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1362:in
strip_html' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:442:inget_feed_summary' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:888:in
update_rss_feed' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1130:in `update_feed'The strip_html() method looks like this...
class String def strip_html html = Nokogiri::HTML.fragment(self.dup) (html/:br).each {|_br| _br.swap(' ') } (html/:p).each {|_p| _p.swap(_p.content + ' ') } Loofah.scrub_fragment(html.content, :strip).text end end
The text it's working on is: