cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
925 stars 171 forks source link

Ignore redundant nesting when checking for related siblings #98

Closed tuzz closed 2 months ago

tuzz commented 3 months ago

If the best candidate is in an element all by itself, then we should probably check its nearest ancestor that has siblings when considering whether to append siblings that meet the score threshold.

For example, in the example below, we would now include the second paragraph whereas previous we would not.

<div>
  <div>
    <p>This is the best candidate.</p>
  </div>
</div>
<p>This paragraph meets the score threshold.</p>

Note that this changes behaviour. We could put this change behind an option if preferred. I think this will improve the extraction for most use cases, though, and none of the existing test cases fail.

cantino commented 3 months ago

Hi @tuzz! Does the Mozilla JS version do this? For consistency we should probably make this an option.

tuzz commented 2 months ago

Hi @cantino, apologies for the slow reply.

No, the Mozilla JS version doesn't have this feature. I've just pushed a commit to hide it behind an option. Hopefully the feature is useful enough to be considered for inclusion. We've found it really helps for some DOM structures.

Thanks

cantino commented 2 months ago

Thanks @tuzz!

cantino commented 2 months ago

Releases in 0.7.2