Account for empty cells in table extraction (xml)

fortyfourforty commented 6 days ago

Hi,

Another inaccuracy issue in XML extraction for tables.

If the table contains one or more empty cells, the XML simply ignore it. For example, it makes a table with 3 row into 2 row.

<table>
<row span="3">
<cell>a</cell>
<cell>b</cell>
</row>
<row span="3">
<cell>f</cell>
<cell>s</cell>
<cell>s</cell>
</row>
<row>
<cell>g</cell>
<cell>b</cell>
</row>
</table>

It's better to extract empty cells as empty string or None to keep the layout correct.

<table>
<row span="3">
<cell>a</cell>
<cell></cell>
<cell>b</cell>
</row>
<row span="3">
<cell>f</cell>
<cell>s</cell>
<cell>s</cell>
</row>
<row>
<cell>g</cell>
<cell>b</cell>
<cell>None</cell>
</row>
</table>

adbar commented 6 days ago

It's not a bug in itself be I agree things could be improved, do you want to work on a PR?

fortyfourforty commented 6 days ago

I wish I could but my little, self-taught knowledge of Python and GitHub does not allow me to get my hands on PRs. 😞

adbar / trafilatura

Account for empty cells in table extraction (xml) #633