adbar / trafilatura

Python & command-line tool to gather text on the Web: Crawling & scraping, content extraction, metadata. TXT, Markdown, CSV & XML output.
https://trafilatura.readthedocs.io
Apache License 2.0
3.18k stars 239 forks source link

Account for empty cells in table extraction (xml) #633

Open fortyfourforty opened 6 days ago

fortyfourforty commented 6 days ago

Hi,

Another inaccuracy issue in XML extraction for tables.

If the table contains one or more empty cells, the XML simply ignore it. For example, it makes a table with 3 row into 2 row.

<table>
<row span="3">
<cell>a</cell>
<cell>b</cell>
</row>
<row span="3">
<cell>f</cell>
<cell>s</cell>
<cell>s</cell>
</row>
<row>
<cell>g</cell>
<cell>b</cell>
</row>
</table>

It's better to extract empty cells as empty string or None to keep the layout correct.

<table>
<row span="3">
<cell>a</cell>
<cell></cell>
<cell>b</cell>
</row>
<row span="3">
<cell>f</cell>
<cell>s</cell>
<cell>s</cell>
</row>
<row>
<cell>g</cell>
<cell>b</cell>
<cell>None</cell>
</row>
</table>
adbar commented 6 days ago

It's not a bug in itself be I agree things could be improved, do you want to work on a PR?

fortyfourforty commented 6 days ago

I wish I could but my little, self-taught knowledge of Python and GitHub does not allow me to get my hands on PRs. 😞