Closed pagojo closed 12 years ago
@pagojo have you had any luck figuring this out?
I haven't looked at it yet.
Perhaps I should hmmm...
On first look it appears that when parsing this page score_paragraphs(min_text_length)
returns an empty hash, so subsequent calls to select_best_candidate
fail.
Irrespective of whether score_paragraphs
is right or wrong select_best_candidate
should not die like this.
In fact there is a check in select_best_candidate
for whether a best candidate exists which also fails because @html.css("body").first
is nil
for the Lifehacker page :-/
best_candidate = sorted_candidates.first || { :elem => @html.css("body").first, :content_score => 0 }
It seems that the body
is removed in:
def remove_unlikely_candidates!
@html.css("*").each do |elem|
str = "#{elem[:class]}#{elem[:id]}"
if str =~ REGEXES[:unlikelyCandidatesRe] && str !~ REGEXES[:okMaybeItsACandidateRe] && elem.name.downcase != 'body'
debug("Removing unlikely candidate - #{str}")
elem.remove
end
end
end
because perhaps the page has a class
and id
assigned to the <body>
<body id="lifehacker" class="page-ajax_post">
I believe I've nailed it, patch coming later today.
Sorry @iterationlabs this is not my week it seems :-/
I just observed the following while retrieving a Lifehacker page:
as well as when I do
The result is the following exception due to a nil element.
Any ideas if this is a bug or am I doing something wrong?