TeamHG-Memex / html-text

Extract text from HTML
MIT License
130 stars 24 forks source link

guess_layout does not work on XHTML elements #25

Open keturn opened 4 years ago

keturn commented 4 years ago

After the failure of extract_text in #24, I tried etree_to_text.

I got through that without encountering an exception, but guess_layout doesn't work: no newlines are added after those tags.

I think it's because element.tag includes the tag's XML namespace, so it doesn't match the namespaceless NEWLINE_TAGS and DOUBLE_NEWLINE_TAGS.

Test:


def test_guess_layout():
    xhtml = (u'<html xmlns="http://www.w3.org/1999/xhtml">'
             '<head><title>  title  </title></head>'
             '<body><div>text_1.<p>text_2 text_3</p>'
            '<p id="demo"></p><ul><li>text_4</li><li>text_5</li></ul>'
            '<p>text_6<em>text_7</em>text_8</p>text_9</div>'
            '<script>document.getElementById("demo").innerHTML = '
            '"This should be skipped";</script> <p>...text_10</p>'
            '</body></html>')

    text = ('title\n\ntext_1.\n\ntext_2 text_3\n\ntext_4\ntext_5'
            '\n\ntext_6 text_7 text_8\n\ntext_9\n\n...text_10')
    assert extract_text(xhtml, guess_punct_space=False, guess_layout=True) == text

    assert etree_to_text(etree.fromstring(xhtml,parser=etree.XMLParser()), guess_layout=True) == text
keturn commented 4 years ago

This could be handled either by altering traverse_text_fragments to get the tag's local name (using etree.QName), or adding a duplicate of each tag to the NEWLINE_TAGS set that has {http://www.w3.org/1999/xhtml} prepended.