buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
https://github.com/buriy/python-readability
Apache License 2.0
2.66k stars 348 forks source link

Wrong length-check in summary when using xpath=True results wrong summaries #146

Open yeus opened 4 years ago

yeus commented 4 years ago

When trying to use xpath=True in summary to extract the main content, you get the wrong result for several webpages, otherwise its correct.

The reason is that the length check in the summary function gets done on the html including the xpath attributes. This should not be the case. This gives different results when using xpath vs. not using it and also implicitly defines a different len threshold for selecting the summary.

https://github.com/buriy/python-readability/blob/e4a699bbb03e50f45468de228f549d2c32fc1034/readability/readability.py#L254

One idea might be: add the xpath attributes to the html at the end after all calculations have been done rather in the beginning:

https://github.com/buriy/python-readability/blob/e4a699bbb03e50f45468de228f549d2c32fc1034/readability/readability.py#L150

best, Thomas