cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
919 stars 170 forks source link

Help troubleshooting what's stripped out #96

Open avk opened 6 months ago

avk commented 6 months ago

Thanks for your work on this neat gem.

Running readability on the HTML from, I expected more markup to remain than readability leaves intact.




In the screenshot above, the following content is stripped out:

  1. the red "Submit" heading:

    <h1 class="titles">
    <a href="" rel="bookmark" title="SubmitPermanent Link to ">Submit</a>
  2. the red "Submissions are now open through January 9, 2024" and "Submit!" headings and links:

    <h2 style="text-align: center;"><a href="">Submissions are now open through January 9, 2024!</a></h2>
    <h2 style="text-align: center;"><a href="">Submit!</a></h2>

Turning on debug: true doesn't seem to cite why these items are missing:

% readability -d
/Users/avk/.rvm/gems/ruby-2.7.8@wbm/gems/ruby-readability-0.7.0/bin/readability:31: warning: calling via Kernel#open is deprecated, call directly or use URI#open
Removing unlikely candidate - magnific_popup-css
Removing unlikely candidate - nav superfishmenu-100-word-story-menu
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-73menu-item-73
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page current-menu-item page_item page-item-6 current_page_item menu-item-72menu-item-72
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-83menu-item-83
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-189menu-item-189
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-70menu-item-70
Removing unlikely candidate - header
Removing unlikely candidate - comments
Removing unlikely candidate - commentlist clearfix
Removing unlikely candidate - comment even thread-even depth-1 parentcomment-65
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment byuser comment-author-100words bypostauthor odd alt depth-2comment-66
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment byuser comment-author-100words bypostauthor even thread-odd thread-alt depth-1comment-57
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment odd alt thread-even depth-1comment-56
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment even thread-odd thread-alt depth-1comment-52
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - sidebar-wrapper
Removing unlikely candidate - sidebar
Removing unlikely candidate - sidebar-box widget_blockblock-3
Removing unlikely candidate - widget_text sidebar-box widget_custom_htmlcustom_html-2
Removing unlikely candidate - sidebar-box widget_texttext-3
Removing unlikely candidate - sidebar-box widget_texttext-4
Removing unlikely candidate - sidebar-box widget_texttext-7
Removing unlikely candidate - sidebar-box widget_linkslinkcat-10
Removing unlikely candidate - footer
Altering div(#pages.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Top 5 candidates:
Candidate with score 51.935052531041066
Candidate div#left-div. with score 16.71186440677966
Best candidate with score 51.935052531041066
Conditionally cleaned div#.addtoany_share_save_container addtoany_content addtoany_content_bottom with weight 25 and content score 0 because it has too short a content length without a single image.
Conditionally cleaned div#.a2a_kit a2a_kit_size_24 addtoany_list with weight 0 and content score 0 because it has too short a content length without a single image.
Conditionally cleaned div#.recentposts with weight 25 and content score 0 because it has too short a content length without a single image.

                    <p>100 words for your story … no more or no less. Tell a story, pen a slice of your memoir, or try your hand at an essay.</p>
<p>You get 100 words—exactly 100 words—which is both the pain and the pleasure here. It’s short, you tell yourself. You could write 100 words at a bus stop, on your lunch break, in your sleep. But with 100 words you must tell the whole story in its entirety, so it holds together like a perfect little doll house. (Your title is not part of the 100 words.)</p>
<p>Please include a short bio (25 words, max!) with your submission. Also, did we say exactly 100 words? We weren’t kidding! We count words according to Microsoft Word’s word-count tally. Also, make friends with your spell-check, or have a friend proofread your story.</p>
<p>We currently charge a $2 submission fee, the minimum in order to cover the costs of the submission system.</p>
<p> </p>
<p> </p>


Any ideas on how to broaden or include this content?

avk commented 6 months ago

I think I've traced this particular effect to this part of #sanitize:

      node.css("h1, h2, h3, h4, h5, h6").each do |header|
        header.remove if class_weight(header) < 0 || get_link_density(header) > 0.33

What's the thinking behind link density, especially applied to headings? Is there a way to customize or tune this?

cantino commented 2 months ago

This is the same code as was in the original readability.js from which this was ported, I think. You could parameterize it if you want to make it more flexible.