cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
919 stars 170 forks source link

Can't handle headers as a hash when guessing encoding #60

Closed boncey closed 10 years ago

boncey commented 10 years ago

I'm using readability with various HTML responses returned from Faraday.

I've just discovered the awesomeness of the 'guess_html_encoding' gem and am trying to get it to work with readability.

This code snippet works fine: require 'readability' require 'guess_html_encoding' require 'faraday'

url = "http://inews.mingpao.com/htm/INews/20130319/gb21451k.htm"
response = Faraday.get(url)
headers = response.headers
html = response.body
guess = GuessHtmlEncoding.guess(html, headers)
puts "Encoding is #{guess}"
doc = Readability::Document.new(html, :encoding => guess, :remove_empty_nodes => true)

However, shortcutting by passing the headers directly to readability fails: require 'readability' require 'faraday'

url = "http://inews.mingpao.com/htm/INews/20130319/gb21451k.htm"
response = Faraday.get(url)
headers = response.headers
html = response.body
doc = Readability::Document.new(html, :html_headers => headers, :remove_empty_nodes => true)

I see this error: NoMethodError: undefined method 'gsub' for #<Faraday::Utils::Headers:0x007fb2fc155de0> Faraday::Utils::Headers subclasses Hash by the way.

It looks like readability is assuming the headers are a string - the README is unclear on what type it should be but as guess_html_encoding takes a hash then I think readability should too.

Thanks, Darren.

cantino commented 10 years ago

When you pass in options[:html_headers] to Readability, it just gets passed on to GuessHtmlEncoding.encode, which appears to only accept a string. See line 45 of guess_html_encoding.rb: https://github.com/cantino/guess_html_encoding/blob/master/lib/guess_html_encoding.rb#L45

So I see two options. One would be for you to pass in a string of headers, joined by \n characters, like:

headers = headers.map {|k, v| "#{k}: #{v}" }.join("\n")

Alternatively, you could send a pull request to GuessHtmlEncoding that changes line 45 to just:

encoding = guess(html_copy, headers || '')

and then also changes line 12 to:

headers = headers.dup.force_encoding("ASCII-8BIT").gsub(/[\r\n]+/, "\n")
boncey commented 10 years ago

Cool, thanks. I have it working as described in my working example above at the moment - will re-visit when time allows.