Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.85k stars 279 forks source link

Two backslashes gets converted to 3 backslashes #404

Open tomgoddard opened 11 months ago

tomgoddard commented 11 months ago

In the current PyPi html2text converting a single backslash in html produces a single backslash in plain text. That seems right. But converting two backslashes in html produces 3 backslashes in plain text. It seems like two backslashes in html should produce two in plain text. The where I am seeing this is in html that shows two backslashes in Windows some file paths to indicate the backslash is escaped. When we convert in our ChimeraX application to plain text for bug reporting it then appears as 3 backslashes in the file names (https://www.rbvi.ucsf.edu/trac/ChimeraX/ticket/10252).

Note that in the python strings in the test script below the appearance of two backslashes in a Python string means just one backslash since "\" is an escape indicating a single character string containing one backslash.

import html2text
h = html2text.HTML2Text()
h.handle('<p>\\</p>')
    '\\\n\n'   # Seems right
h.handle('<p>\\\\</p>')
    '\n\n\\\\\\\n\n'  # Seems wrong, 3 backslashes in the output.
html2text.__version__
    (2020, 1, 16)