cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
925 stars 171 forks source link

Extra div added (is it expected?) #66

Open thbar opened 10 years ago

thbar commented 10 years ago

First, thanks for your work on readability :-)

Just a quick feedback (I'm not a heavy user myself): while upgrading an old setup today, I noticed that a raw content is now wrapped into two levels of divs:

1.9.3-p484 :003 > Readability::Document.new("My content").content
 => "<div><div><p>My content</p></div></div>" 

while previously (2-year old version) was returned as:

 => "<div><p>My content</p></div>" 

Is it expected? I understand that this specific test-case is a bit unrelatistic (not tags at all), but wondered if there could be other similar issues with properly formatted html.

cantino commented 10 years ago

Good question. I wonder which changes caused that. I'm not actively working on Readability these days, but am always willing to vet pull requests from interested contributors.

cqcn1991 commented 8 years ago

Yes, I'm having the same problem. Is there any thing that I can do to fix this?

borama commented 8 years ago

I am not sure if this is expected or not and if there is anything to fix but I found the following:

the two <div> tags are added in the get_article method. The method first always wraps the found article with a <div> (here). Then, it copies all children tags of the found article and if the article itself is a different tag than <p> or <div>, it changes the tag to <div> (here). Because your article node, i.e. the parent node of the single paragraph in your input html, is the <body> tag, it is changed to a <div> tag, effectively resulting in two <div>s in the output.

cantino commented 8 years ago

Thanks @borama, I'm open to a PR with a fix if you're diving into it.