Closed YavorIvanov closed 10 years ago
First clue I have around this is that it's probably a Nokogiri issue.
The error is being yield when you do
obj.content
It does seem like a Nokogiri issue. Do you have any idea how to fix it?
Not really. Still working on it.
My guess is that nokogiri parsing should be done with the fix option on. Still looking at the code but I probably won't have strength to finish this tonight.
From what I toyed around with it seems that readability should be using the DocumentFragment.parse of Nokogiri
Nokogiri::HTML::DocumentFragment.parse
This seem to be better suited for working with fragments such as the example where the content has no parent tag. Still not able to figure out the readability internals though to make a meaningful change.
Interesting. Would you mind submitting a pull request?
On Thu, May 23, 2013 at 12:47 AM, Yavor Ivanov notifications@github.comwrote:
From what I toyed around it seems that readability should be using the DocumentFragment.parse of Nokogiri
Nokogiri::HTML::DocumentFragment.parse
This seem to be better suited for working with fragments such as the example where the content has no parent tag. Still not able to figure out the readability internals though to make a meaningful change.
— Reply to this email directly or view it on GitHubhttps://github.com/cantino/ruby-readability/issues/50#issuecomment-18328261 .
Sure. Let me figure things out and will make a pull request.
Hi, not sure how to fix the code. I've narrowed it down to these lines of code but can't change it so the tests pass. I'm probably missing something.
([node] + node.css("*")).each do |el|
# If element is in whitelist, delete all its attributes
if whitelist[el.node_name]
el.attributes.each { |a, x| el.delete(a) unless @options[:attributes] && @options[:attributes].include?(a.to_s) }
# Otherwise, replace the element with its contents
else
if replace_with_whitespace[el.node_name]
el.swap(Nokogiri::XML::Text.new(' ' << el.text << ' ', el.document))
else
el.swap(Nokogiri::XML::Text.new(el.text, el.document))
end
end
end
Do you know which line from that code is failing with your error case?
It may be that you should wrap your input in a <div></div>
, but I'm not sure. If you send a pull request with a failing spec, I'll take a crack at it.
I meet this issue too. Actually it can be easily reproduced with the following code:
require 'readability'
page = '<div>test</div>'
doc = Readability::Document.new(page, :tags => [])
puts doc.content
The problem is in the else-clause of the code cited by @YavorIvanov, and I don't think it is a Nokogiri issue:
# Otherwise, replace the element with its contents
else
if replace_with_whitespace[el.node_name]
el.swap(Nokogiri::XML::Text.new(' ' << el.text << ' ', el.document))
else
el.swap(Nokogiri::XML::Text.new(el.text, el.document))
end
end
The problem is el
might be a root node, which doesn't have a parent. So the swap()
throws a "Could not reparent node" error. I add some logic:
if el.parent.nil?
node = Nokogiri::XML::Text.new(el.text, el.document)
break
else
if replace_with_whitespace[el.node_name]
el.swap(Nokogiri::XML::Text.new(' ' << el.text << ' ', el.document))
else
el.swap(Nokogiri::XML::Text.new(el.text, el.document))
end
end
This could fix the issue. Do you have any concern for this change?
Thanks for working on this @magic003! I think that's a reasonable solution. Can you send a pull request with a spec?
Pull request has been sent.
Thanks to @magic003 for fixing this!
Hi,
I stumbled upon an interesting problem. I think this shouldn't really happen.
Running with DIV tags produces the expected result though.
In case the source of the page changes I'm giving preview:
Specs: ruby-readability (0.5.7) rails (4.0.0.rc1) ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin12.3.0]