Open joouha opened 2 years ago
If anyone else is experiencing the same issue, the following regex substitution can be used as a workaround:
import re
from html2text import HTML2Text
data = "<table><tr><td></td><td>a</td></tr></table>"
result = HTML2Text().handle(data)
print(result)
# | a
# ---|---
result = re.sub(
r"^((\|\s+[^\|]+\s+)((\|\s+[^\|]+\s+|:?-+:?\|)(\|\s+[^\|]+\s+|:?-+:?\|))*:?-+:?\|:?-+:?\s*$)",
r"| \1",
result,
0,
re.MULTILINE,
)
print(result)
print(result)
# | | a
# ---|---
It adds an empty cell at the start of the table if the number of header cells does not match the number of columns in the table.
Can you please update the code with tests?
@joouha Thanks for adding the tests and the patch. Code looks ok, but after running the CI, it failed.
You can run the tests locally by tox
and check the results yourself as well.
Hello
This PR fixes an issue where an empty first cell in a table results in markdown tables which do parse properly.
For example:
results in the following output:
which is not a valid markdown table:
With this fix, the output is:
which renders correctly: