Closed kmatrah closed 12 years ago
That page is not UTF8, it's ISO-8859-1. What we need is a method to guess the encoding of a site. I'm working on this and will add it shortly.
Could you check if this is fixed in the prerelease version of 0.5.0?
gem install ruby-readability --pre
and see https://github.com/iterationlabs/ruby-readability/tree/new_gem
It works fine now with 0.5.0.pre on MRI 1.9 thank you!
On JRuby 1.6.5, the encoding seems to work too but it crashes sometimes for another reason: readability http://mashable.com/2011/10/26/tango-windows-phone-7-5/
null:- 1:in `renameNode': org.w3c.dom.DOMException: NAMESPACE_ERR: An attempt is made to create or change an object in a way which is incorrect with regard to namespaces.
Thanks for extracting and sharing guess_html_encoding!!
Great to hear, thanks for testing it!
0.5.0 has been released!
It seems that pages that contain UTF-8 characters still cannot be processed.
For example, using /bin/readability on a popular french website: readability http://www.developpez.com/actu/35379/Novell-cede-Mono-a-Xamarin-une-mise-a-jour-de-la-plateforme-est-annoncee-pour-l-automne/
It crashes line 216: lib/readability.rb:216:in
=~': invalid byte sequence in UTF-8 (ArgumentError) from /Users/kimious/.rvm/gems/ruby-1.9.2-p180/gems/ruby-readability-0.2.3/lib/readability.rb:216:in
!~'I can reproduce the bug for a lot of french webpages.