Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.74k stars 266 forks source link

Fix tables with empty first cell #380

Open joouha opened 2 years ago

joouha commented 2 years ago

Hello

This PR fixes an issue where an empty first cell in a table results in markdown tables which do parse properly.

For example:

<table>
  <tr><th></th><th>b</th></tr>
  <tr><td>c</td><td>d</td></tr>
</table>

results in the following output:

| b
---|---
c | d

which is not a valid markdown table:

b
c d

With this fix, the output is:

| | b
---|---
c | d

which renders correctly:

b
c d
joouha commented 2 years ago

If anyone else is experiencing the same issue, the following regex substitution can be used as a workaround:

import re
from html2text import HTML2Text

data = "<table><tr><td></td><td>a</td></tr></table>"

result = HTML2Text().handle(data)

print(result)

# | a
# ---|---

result = re.sub(
    r"^((\|\s+[^\|]+\s+)((\|\s+[^\|]+\s+|:?-+:?\|)(\|\s+[^\|]+\s+|:?-+:?\|))*:?-+:?\|:?-+:?\s*$)",
    r"|  \1", 
    result,
    0,
    re.MULTILINE,
)

print(result)

print(result)

# |  | a
# ---|---

It adds an empty cell at the start of the table if the number of header cells does not match the number of columns in the table.

Alir3z4 commented 5 months ago

Can you please update the code with tests?

Alir3z4 commented 5 months ago

@joouha Thanks for adding the tests and the patch. Code looks ok, but after running the CI, it failed.

You can run the tests locally by tox and check the results yourself as well.