cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
925 stars 171 forks source link

Small improvements to improve debugging and flexibility #97

Closed tuzz closed 3 months ago

tuzz commented 4 months ago

Hello, this PR includes a few small improvements that should help users with debugging and allow them improved flexibility in which attributes are preserved and which nodes are removed by clean_conditionally.

1) Allow whitelisting all attributes by setting attributes: ["*"] 2) Allow setting options[:debug] to a function, e.g. so that you can add message to Rails logging 3) Fix sibling content not being stripped when checking its length 4) Allow setting options[:clean_conditionally] to a function so that you can override the default decision

The above changes won't change Readability's behaviour except for the small bug fix in 3).

Thanks for your consideration.

cantino commented 3 months ago

Thanks @tuzz, these changes look reasonable!

cantino commented 2 months ago

Released in 0.7.2

avk commented 2 months ago

I appreciate the test case for options[:clean_conditionally], but I still don't quite understand how to use it. Do you have any other examples @tuzz?

Would it be worth elaborating in the README?

tuzz commented 2 months ago

@avk I can try to explain. I can add something to the README if it would be helpful.

Basically, readability tries to extract the "useful" content from the page. It does this by scoring elements and then extracting the one with the highest score. Within that element, there might be sub-elements that aren't particularly useful. For example, a news article might have an advert banner in the middle of it. Because of this, readability does a second pass called "clean conditionally" where it tries to remove those sorts of elements based on some hardcoded rules. If you switch on debug mode it makes it easier to understand which elements have been "cleaned conditionally".

In some cases, however, it might remove elements that you don't want it to (or include elements that it shouldn't). The change I introduced allows you intervene and override readability's decision using your own lambda. The lambda is provided with some context that includes the HTML element, the element's score, the decision that readability made about whether to remove it, etc. The return value of the lambda should be whether to remove the element or not. For example, if you set clean_conditionally to the following lambda you'd invert all the decisions readability made about whether to remove the element:

clean_conditionally: lambda do |context|
  !context[:remove]
end

Perhaps a more useful lambda would be one where you force readability to always remove a specific piece of content:

clean_conditionally: lambda do |context|
  if context[:el].text.include?("Visit our blog")
    true # Always remove elements that contain 'Visit our blog'
  else
    context[:remove] # Otherwise, remove the element according to readability's default rules.
  end
end

Hopefully that helps!

avk commented 2 months ago

@tuzz thank you; fantastic and thorough! Excited to experiment with this.