cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
925 stars 170 forks source link

Select_best_candidate fails on Lifehacker page #23

Closed pagojo closed 12 years ago

pagojo commented 12 years ago

Sorry @iterationlabs this is not my week it seems :-/

I just observed the following while retrieving a Lifehacker page:

require 'open-uri'
require 'readability'
page = open("http://lifehacker.com/5866449/lifehacker-faceoff-the-best-digital-digests-on-ipad-and-iphone").read
rbody = Readability::Document.new(page).content

as well as when I do

rbody = Readability::Document.new(page, :do_not_guess_encoding => true).content

The result is the following exception due to a nil element.

#<NoMethodError: undefined method `name' for nil:NilClass>
NoMethodError: undefined method `name' for nil:NilClass
    from /home/user/.rvm/gems/ruby-1.9.2-p290@classic/gems/ruby-readability-0.5.2/lib/readability.rb:194:in `select_best_candidate'
    from /home/user/.rvm/gems/ruby-1.9.2-p290@classic/gems/ruby-readability-0.5.2/lib/readability.rb:44:in `prepare_candidates'
    from /home/user/.rvm/gems/ruby-1.9.2-p290@classic/gems/ruby-readability-0.5.2/lib/readability.rb:130:in `content'

Any ideas if this is a bug or am I doing something wrong?

cantino commented 12 years ago

@pagojo have you had any luck figuring this out?

pagojo commented 12 years ago

I haven't looked at it yet.

Perhaps I should hmmm...

pagojo commented 12 years ago

On first look it appears that when parsing this page score_paragraphs(min_text_length) returns an empty hash, so subsequent calls to select_best_candidate fail.

Irrespective of whether score_paragraphs is right or wrong select_best_candidate should not die like this. In fact there is a check in select_best_candidate for whether a best candidate exists which also fails because @html.css("body").first is nil for the Lifehacker page :-/

best_candidate = sorted_candidates.first || { :elem => @html.css("body").first, :content_score => 0 }

It seems that the body is removed in:

def remove_unlikely_candidates!
      @html.css("*").each do |elem|
        str = "#{elem[:class]}#{elem[:id]}"
        if str =~ REGEXES[:unlikelyCandidatesRe] && str !~ REGEXES[:okMaybeItsACandidateRe] && elem.name.downcase != 'body'
          debug("Removing unlikely candidate - #{str}")
          elem.remove
        end
      end
    end

because perhaps the page has a class and id assigned to the <body> <body id="lifehacker" class="page-ajax_post">

pagojo commented 12 years ago

I believe I've nailed it, patch coming later today.

pagojo commented 12 years ago

committed