Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.76k stars 270 forks source link

WRAP_LIST_ITEMS setting is not respected #352

Open SebCorbin opened 3 years ago

SebCorbin commented 3 years ago
body = """
1. Error exercitationem debitis magni tenetur dolorum inventore ex. Voluptatibus possimus voluptas quibusdam vel facere eaque sit. Et et hic totam aliquam et ut numquam. Omnis qui consectetur reiciendis. Deserunt qui aut mollitia qui. Dolores omnis aut facere sint et rerum.
2. Modi excepturi velit ab fuga dignissimos qui. Et dolorem ut quam consequatur. Quia repellat deleniti et aut quae in. Cum quidem maiores sint suscipit nobis ipsam.
3. Et tenetur sapiente velit. Neque culpa perspiciatis et molestias voluptatem officia rem. Dolorem reprehenderit recusandae nostrum voluptatem nihil et modi neque. Libero et tempore odit. Saepe quo dolorum voluptas. Aliquam illo nam eos qui eum.
"""
h = html2text.HTML2Text()
h.body_width = 80
h.wrap_list_items = True
print(h.handle(body))

Should normally render

 1. Error exercitationem debitis magni tenetur dolorum inventore ex.
Voluptatibus possimus voluptas quibusdam vel facere eaque sit. Et et hic totam
aliquam et ut numquam. Omnis qui consectetur reiciendis. Deserunt qui aut
mollitia qui. Dolores omnis aut facere sint et rerum.

2. Modi excepturi velit ab fuga dignissimos qui. Et dolorem ut quam
consequatur. Quia repellat deleniti et aut quae in. Cum quidem maiores sint
suscipit nobis ipsam.

3. Et tenetur sapiente velit. Neque culpa perspiciatis et molestias voluptatem
officia rem. Dolorem reprehenderit recusandae nostrum voluptatem nihil et modi
neque. Libero et tempore odit. Saepe quo dolorum voluptas. Aliquam illo nam eos
qui eum.

But instead it return unwrapped text

I suggest changing skipwrap() end of function to:

    # If the text begins with a single -, *, or +, followed by a space,
    # or an integer, followed by a ., followed by a space (in either
    # case optionally proceeded by whitespace), it's a list; don't wrap,
    # unless explicitly specified.
    return bool(
        config.RE_ORDERED_LIST_MATCHER.match(stripped)
        or config.RE_UNORDERED_LIST_MATCHER.match(stripped)
    ) and not wrap_list_items
TB-effective commented 4 months ago

This bug is still present – numbered lists don't respect the wrap_list_items and body_width settings. Note that unordered lists are wrapped correctly, only ordered ones stay unwrapped. Looking at skipwrap(), it seems it has two parts where it tries to react to lists, one as shown in the description where it uses the RE_ORDERED_LIST_MATCHER and RE_UNORDERED_LIST_MATCHER regexes, but there is also another part before that where it matches on literal list item characters:

    # I'm not sure what this is for; I thought it was to detect lists,
    # but there's a <br>-inside-<span> case in one of the tests that
    # also depends upon it.
    if stripped[0:1] in ("-", "*") and not stripped[0:2] == "**":
        return not wrap_list_items

I would assume this is why it works correctly for unordered lists.

So @SebCorbin's fix looks like it should do the trick.