Wrong length-check in summary when using xpath=True results wrong summaries

When trying to use xpath=True in summary to extract the main content, you get the wrong result for several webpages, otherwise its correct.

The reason is that the length check in the summary function gets done on the html including the xpath attributes. This should not be the case. This gives different results when using xpath vs. not using it and also implicitly defines a different len threshold for selecting the summary.

https://github.com/buriy/python-readability/blob/e4a699bbb03e50f45468de228f549d2c32fc1034/readability/readability.py#L254

One idea might be: add the xpath attributes to the html at the end after all calculations have been done rather in the beginning:

https://github.com/buriy/python-readability/blob/e4a699bbb03e50f45468de228f549d2c32fc1034/readability/readability.py#L150

best, Thomas

buriy / python-readability

Wrong length-check in summary when using xpath=True results wrong summaries #146