Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.79k stars 273 forks source link

bypass_tables cutoff rowspan/colspan #307

Open themikesam opened 4 years ago

themikesam commented 4 years ago

PYTHON SCRIPT

import html2text

h=html2text.HTML2Text()
h.bypass_tables = True
ori = '<table border="1" cellspacing="0" cellpadding="0" width="0"><tbody><tr><td colspan="5">TITLE A</td></tr><tr><td>ROW 1 COL 1</td><td>ROW 1 COL 2</td><td>ROW 1 COL 3</td><td>ROW 1 COL 4</td><td>ROW 1 COL 5</td></tr><tr><td>ROW 2 COL 1</td><td>ROW 2 COL 2</td><td>ROW 2 COL 3</td><td>ROW 2 COL 4</td><td>ROW 2 COL 5</td></tr><tr><td>ROW 3 COL 1</td><td>ROW 3 COL 2</td><td>ROW 3 COL 3</td><td>ROW 3 COL 4</td><td>ROW 3 COL 5</td></tr><tr><td>ROW 4 COL 1</td><td>ROW 4 COL 2</td><td>ROW 4 COL 3</td><td>ROW 4 COL 4</td><td>ROW 4 COL 5</td></tr><tr><td>ROW 5 COL 1</td><td>ROW 5 COL 2</td><td>ROW 5 COL 3</td><td>ROW 5 COL 4</td><td>ROW 5 COL 5</td></tr><tr><td>ROW 6 COL 1</td><td>ROW 6 COL 2</td><td>ROW 6 COL 3</td><td>ROW 6 COL 4</td><td>ROW 6 COL 5</td></tr><tr><td colspan="5">TITLE B</td></tr><tr><td>ROW 1 COL 1</td><td>ROW 1 COL 2</td><td>ROW 1 COL 3</td><td>ROW 1 COL 4</td><td>ROW 1 COL 5</td></tr><tr><td>ROW 2 COL 1</td><td>ROW 2 COL 2</td><td>ROW 2 COL 3</td><td>ROW 2 COL 4</td><td>ROW 2 COL 5</td></tr><tr><td>ROW 3 COL 1</td><td>ROW 3 COL 2</td><td>ROW 3 COL 3</td><td>ROW 3 COL 4</td><td>ROW 3 COL 5</td></tr><tr><td>ROW 4 COL 1</td><td>ROW 4 COL 2</td><td>ROW 4 COL 3</td><td>ROW 4 COL 4</td><td>ROW 4 COL 5</td></tr><tr><td>ROW 5 COL 1</td><td>ROW 5 COL 2</td><td>ROW 5 COL 3</td><td>ROW 5 COL 4</td><td>ROW 5 COL 5</td></tr><tr><td>ROW 6 COL 1</td><td>ROW 6 COL 2</td><td>ROW 6 COL 3</td><td>ROW 6 COL 4</td><td>ROW 6 COL 5</td></tr></tbody></table>'
print(h.handle(ori)) # SEE OUTPUT IMAGE BELOW.

HTML INPUT

<table border="1" cellspacing="0" cellpadding="0" width="0">
<tbody>
    <tr>
        <td colspan="5">TITLE A</td>
    </tr>
    <tr>
        <td>ROW 1 COL 1</td>
        <td>ROW 1 COL 2</td>
        <td>ROW 1 COL 3</td>
        <td>ROW 1 COL 4</td>
        <td>ROW 1 COL 5</td>
    </tr>
    <tr>
        <td>ROW 2 COL 1</td>
        <td>ROW 2 COL 2</td>
        <td>ROW 2 COL 3</td>
        <td>ROW 2 COL 4</td>
        <td>ROW 2 COL 5</td>
    </tr>
    <tr>
        <td>ROW 3 COL 1</td>
        <td>ROW 3 COL 2</td>
        <td>ROW 3 COL 3</td>
        <td>ROW 3 COL 4</td>
        <td>ROW 3 COL 5</td>
    </tr>
    <tr>
        <td>ROW 4 COL 1</td>
        <td>ROW 4 COL 2</td>
        <td>ROW 4 COL 3</td>
        <td>ROW 4 COL 4</td>
        <td>ROW 4 COL 5</td>
    </tr>
    <tr>
        <td>ROW 5 COL 1</td>
        <td>ROW 5 COL 2</td>
        <td>ROW 5 COL 3</td>
        <td>ROW 5 COL 4</td>
        <td>ROW 5 COL 5</td>
    </tr>
    <tr>
        <td>ROW 6 COL 1</td>
        <td>ROW 6 COL 2</td>
        <td>ROW 6 COL 3</td>
        <td>ROW 6 COL 4</td>
        <td>ROW 6 COL 5</td>
    </tr>
    <tr>
        <td colspan="5">TITLE B</td>
    </tr>
    <tr>
        <td>ROW 1 COL 1</td>
        <td>ROW 1 COL 2</td>
        <td>ROW 1 COL 3</td>
        <td>ROW 1 COL 4</td>
        <td>ROW 1 COL 5</td>
    </tr>
    <tr>
        <td>ROW 2 COL 1</td>
        <td>ROW 2 COL 2</td>
        <td>ROW 2 COL 3</td>
        <td>ROW 2 COL 4</td>
        <td>ROW 2 COL 5</td>
    </tr>
    <tr>
        <td>ROW 3 COL 1</td>
        <td>ROW 3 COL 2</td>
        <td>ROW 3 COL 3</td>
        <td>ROW 3 COL 4</td>
        <td>ROW 3 COL 5</td>
    </tr>
    <tr>
        <td>ROW 4 COL 1</td>
        <td>ROW 4 COL 2</td>
        <td>ROW 4 COL 3</td>
        <td>ROW 4 COL 4</td>
        <td>ROW 4 COL 5</td>
    </tr>
    <tr>
        <td>ROW 5 COL 1</td>
        <td>ROW 5 COL 2</td>
        <td>ROW 5 COL 3</td>
        <td>ROW 5 COL 4</td>
        <td>ROW 5 COL 5</td>
    </tr>
    <tr>
        <td>ROW 6 COL 1</td>
        <td>ROW 6 COL 2</td>
        <td>ROW 6 COL 3</td>
        <td>ROW 6 COL 4</td>
        <td>ROW 6 COL 5</td>
    </tr>
</tbody>
</table>

HTMLOUTPUT

<table>  
<tr>
<td>

TITLE A

</td></tr>
<tr>
<td>

ROW 1 COL 1

</td>
<td>

ROW 1 COL 2

</td>
<td>

ROW 1 COL 3

</td>
<td>

ROW 1 COL 4

</td>
<td>

ROW 1 COL 5

</td></tr>
<tr>
<td>

ROW 2 COL 1

</td>
<td>

ROW 2 COL 2

</td>
<td>

ROW 2 COL 3

</td>
<td>

ROW 2 COL 4

</td>
<td>

ROW 2 COL 5

</td></tr>
<tr>
<td>

ROW 3 COL 1

</td>
<td>

ROW 3 COL 2

</td>
<td>

ROW 3 COL 3

</td>
<td>

ROW 3 COL 4

</td>
<td>

ROW 3 COL 5

</td></tr>
<tr>
<td>

ROW 4 COL 1

</td>
<td>

ROW 4 COL 2

</td>
<td>

ROW 4 COL 3

</td>
<td>

ROW 4 COL 4

</td>
<td>

ROW 4 COL 5

</td></tr>
<tr>
<td>

ROW 5 COL 1

</td>
<td>

ROW 5 COL 2

</td>
<td>

ROW 5 COL 3

</td>
<td>

ROW 5 COL 4

</td>
<td>

ROW 5 COL 5

</td></tr>
<tr>
<td>

ROW 6 COL 1

</td>
<td>

ROW 6 COL 2

</td>
<td>

ROW 6 COL 3

</td>
<td>

ROW 6 COL 4

</td>
<td>

ROW 6 COL 5

</td></tr>
<tr>
<td>

TITLE B

</td></tr>
<tr>
<td>

ROW 1 COL 1

</td>
<td>

ROW 1 COL 2

</td>
<td>

ROW 1 COL 3

</td>
<td>

ROW 1 COL 4

</td>
<td>

ROW 1 COL 5

</td></tr>
<tr>
<td>

ROW 2 COL 1

</td>
<td>

ROW 2 COL 2

</td>
<td>

ROW 2 COL 3

</td>
<td>

ROW 2 COL 4

</td>
<td>

ROW 2 COL 5

</td></tr>
<tr>
<td>

ROW 3 COL 1

</td>
<td>

ROW 3 COL 2

</td>
<td>

ROW 3 COL 3

</td>
<td>

ROW 3 COL 4

</td>
<td>

ROW 3 COL 5

</td></tr>
<tr>
<td>

ROW 4 COL 1

</td>
<td>

ROW 4 COL 2

</td>
<td>

ROW 4 COL 3

</td>
<td>

ROW 4 COL 4

</td>
<td>

ROW 4 COL 5

</td></tr>
<tr>
<td>

ROW 5 COL 1

</td>
<td>

ROW 5 COL 2

</td>
<td>

ROW 5 COL 3

</td>
<td>

ROW 5 COL 4

</td>
<td>

ROW 5 COL 5

</td></tr>
<tr>
<td>

ROW 6 COL 1

</td>
<td>

ROW 6 COL 2

</td>
<td>

ROW 6 COL 3

</td>
<td>

ROW 6 COL 4

</td>
<td>

ROW 6 COL 5

</td></tr></table>

image

jdufresne commented 4 years ago

Thanks for the report. Would you like to submit a PR to fix it with a test?

themikesam commented 4 years ago

Thanks for the report. Would you like to submit a PR to fix it with a test?

Sure, since I’m not familiar with this, may u give me a direction where the codes are related to processing table? I might take a look on it and have a try.