buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
https://github.com/buriy/python-readability
Apache License 2.0
2.65k stars 348 forks source link

REGEXES["divToPElementsRe"] logical error #160

Closed luoqishuai closed 3 years ago

luoqishuai commented 3 years ago

In readability transform_misused_divs_into_paragraphs

for elem in self.tags(self.html, "div"):
    if not REGEXES["divToPElementsRe"].search(str_(b"".join(map(tostring_, list(elem))))):

Because elem always has "div", re.search will never take effect

demo

from readability.readability import *
import re
doc=Document('<div></div>')
print(tostring_(doc._html()))
node_list=[node for node in doc.tags(doc.html,'div')]
search_str=''.join(map(lambda x:tostring_(x).decode(),node_list))
re.search('<(a|blockquote|dl|div|img|ol|p|pre|table|ul)',search_str)

output

b'<html><body><div/></body></html>'
 <_sre.SRE_Match object; span=(0, 4), match='<div'>

Please let me know if I get it wrong

buriy commented 3 years ago

tostring_ gets HTML and text inside the elements.

luoqishuai commented 3 years ago

compat/init.py:

from lxml.etree import tostring
def tostring_(s):
    return tostring(s, encoding='utf-8')

I run tostring(node_list[0])

output b'<div>a</div>'

It looks like tostring(node) also contains node's tag

buriy commented 3 years ago

ok thanks i'll fix that. This was supposed to replace div to p if they contain only text but no tags inside.