Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.85k stars 283 forks source link

IndexError when padding nested tables #370

Open pigmonkey opened 3 years ago

pigmonkey commented 3 years ago

I use html2text with the --pad-tables flag in my mailcap to read HTML email. Occasionally, html2text will fail when attempting to process an unholy nested table mess.

For example, the following html demonstrates the problem:

<html>
    <table>
        <tr>
            <td>
                <table>
                    <tr>
                        <td>
                            foo
                        </td>
                    </tr>
                </table>
            </td>
        </tr>
    </table>
</html>

Attempting to process this with the --pad-tables flag results in an IndexError:

Traceback (most recent call last):
  File "/usr/bin/html2text", line 33, in <module>
    sys.exit(load_entry_point('html2text==2020.1.16', 'console_scripts', 'html2text')())
  File "/usr/lib/python3.9/site-packages/html2text/cli.py", line 306, in main
    sys.stdout.write(h.handle(html))
  File "/usr/lib/python3.9/site-packages/html2text/__init__.py", line 146, in handle
    return pad_tables_in_text(markdown)
  File "/usr/lib/python3.9/site-packages/html2text/utils.py", line 273, in pad_tables_in_text
    table = reformat_table(table_buffer, right_margin)
  File "/usr/lib/python3.9/site-packages/html2text/utils.py", line 223, in reformat_table
    max_width = [len(x.rstrip()) + right_margin for x in lines[0].split("|")]
IndexError: list index out of range

It works fine without --pad-tables. If html2text cannot figure out the padding, I would prefer it to just fall back to rendering as if --pad-tables was not given.

THuangIAN commented 3 years ago

I also encountered similar problems, but I didn't find a solution to the problem, so how did you deal with it?

pigmonkey commented 3 years ago

I have not found any solution.