cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
925 stars 170 forks source link

UTF-8 characters again #6

Closed kmatrah closed 12 years ago

kmatrah commented 13 years ago

It seems that pages that contain UTF-8 characters still cannot be processed.

For example, using /bin/readability on a popular french website: readability http://www.developpez.com/actu/35379/Novell-cede-Mono-a-Xamarin-une-mise-a-jour-de-la-plateforme-est-annoncee-pour-l-automne/

It crashes line 216: lib/readability.rb:216:in =~': invalid byte sequence in UTF-8 (ArgumentError) from /Users/kimious/.rvm/gems/ruby-1.9.2-p180/gems/ruby-readability-0.2.3/lib/readability.rb:216:in!~'

I can reproduce the bug for a lot of french webpages.

cantino commented 13 years ago

That page is not UTF8, it's ISO-8859-1. What we need is a method to guess the encoding of a site. I'm working on this and will add it shortly.

ghost commented 12 years ago

Could you check if this is fixed in the prerelease version of 0.5.0?

gem install ruby-readability --pre

and see https://github.com/iterationlabs/ruby-readability/tree/new_gem

kmatrah commented 12 years ago

It works fine now with 0.5.0.pre on MRI 1.9 thank you!

On JRuby 1.6.5, the encoding seems to work too but it crashes sometimes for another reason: readability http://mashable.com/2011/10/26/tango-windows-phone-7-5/

null:- 1:in `renameNode': org.w3c.dom.DOMException: NAMESPACE_ERR: An attempt is made to create or change an object in a way which is incorrect with regard to namespaces.

Thanks for extracting and sharing guess_html_encoding!!

ghost commented 12 years ago

Great to hear, thanks for testing it!

ghost commented 12 years ago

0.5.0 has been released!