Closed tuzz closed 3 months ago
Thanks @tuzz, these changes look reasonable!
Released in 0.7.2
I appreciate the test case for options[:clean_conditionally]
, but I still don't quite understand how to use it. Do you have any other examples @tuzz?
Would it be worth elaborating in the README?
@avk I can try to explain. I can add something to the README if it would be helpful.
Basically, readability tries to extract the "useful" content from the page. It does this by scoring elements and then extracting the one with the highest score. Within that element, there might be sub-elements that aren't particularly useful. For example, a news article might have an advert banner in the middle of it. Because of this, readability does a second pass called "clean conditionally" where it tries to remove those sorts of elements based on some hardcoded rules. If you switch on debug
mode it makes it easier to understand which elements have been "cleaned conditionally".
In some cases, however, it might remove elements that you don't want it to (or include elements that it shouldn't). The change I introduced allows you intervene and override readability's decision using your own lambda. The lambda is provided with some context that includes the HTML element, the element's score, the decision that readability made about whether to remove
it, etc. The return value of the lambda should be whether to remove the element or not. For example, if you set clean_conditionally
to the following lambda you'd invert all the decisions readability made about whether to remove the element:
clean_conditionally: lambda do |context|
!context[:remove]
end
Perhaps a more useful lambda would be one where you force readability to always remove a specific piece of content:
clean_conditionally: lambda do |context|
if context[:el].text.include?("Visit our blog")
true # Always remove elements that contain 'Visit our blog'
else
context[:remove] # Otherwise, remove the element according to readability's default rules.
end
end
Hopefully that helps!
@tuzz thank you; fantastic and thorough! Excited to experiment with this.
Hello, this PR includes a few small improvements that should help users with debugging and allow them improved flexibility in which attributes are preserved and which nodes are removed by
clean_conditionally
.1) Allow whitelisting all attributes by setting
attributes: ["*"]
2) Allow settingoptions[:debug]
to a function, e.g. so that you can add message to Rails logging 3) Fix sibling content not being stripped when checking its length 4) Allow settingoptions[:clean_conditionally]
to a function so that you can override the default decisionThe above changes won't change Readability's behaviour except for the small bug fix in 3).
Thanks for your consideration.