When trying to use xpath=True in summary to extract the main content, you get the wrong result for several webpages, otherwise its correct.
The reason is that the length check in the summary function gets done on the html including the xpath attributes. This should not be the case. This gives different results when using xpath vs. not using it and also implicitly defines a different len threshold for selecting the summary.
When trying to use xpath=True in summary to extract the main content, you get the wrong result for several webpages, otherwise its correct.
The reason is that the length check in the summary function gets done on the html including the xpath attributes. This should not be the case. This gives different results when using xpath vs. not using it and also implicitly defines a different len threshold for selecting the summary.
https://github.com/buriy/python-readability/blob/e4a699bbb03e50f45468de228f549d2c32fc1034/readability/readability.py#L254
One idea might be: add the xpath attributes to the html at the end after all calculations have been done rather in the beginning:
https://github.com/buriy/python-readability/blob/e4a699bbb03e50f45468de228f549d2c32fc1034/readability/readability.py#L150
best, Thomas