adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.35k stars 245 forks source link

Doesn't extract links in table #523

Open obeone opened 5 months ago

obeone commented 5 months ago

Trying to convert this page to XML : https://pve.proxmox.com/pve-docs/

Especially this part :


<table class="tableblock frame-all grid-all" style="
width:100%;
">
  <col style="width:50%;">
  <col style="width:50%;">
  <thead>
    <tr>
      <th class="tableblock halign-left valign-top">Format </th>
      <th class="tableblock halign-left valign-top">Link</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td class="tableblock halign-left valign-top">
        <p class="tableblock">Printable version</p>
      </td>
      <td class="tableblock halign-left valign-top">
        <p class="tableblock">
          <a href="pve-admin-guide.pdf">pve-admin-guide.pdf</a>
        </p>
      </td>
    </tr>
    <tr>
      <td class="tableblock halign-left valign-top">
        <p class="tableblock">Online HTML version</p>
      </td>
      <td class="tableblock halign-left valign-top">
        <p class="tableblock">
          <a href="pve-admin-guide.html">pve-admin-guide.html</a>
        </p>
      </td>
    </tr>
    <tr>
      <td class="tableblock halign-left valign-top">
        <p class="tableblock">E-Book version</p>
      </td>
      <td class="tableblock halign-left valign-top">
        <p class="tableblock">
          <a href="pve-admin-guide.epub">pve-admin-guide.epub</a>
        </p>
      </td>
    </tr>
  </tbody>
</table>

Tables are returned without links :

    <table>
      <row>
        <cell role="head">Format</cell>
        <cell role="head">Link</cell>
      </row>
      <row>
        <cell>
          <p>Printable version</p>
        </cell>
      </row>
      <row>
        <cell>
          <p>Online HTML version</p>
        </cell>
      </row>
      <row>
        <cell>
          <p>E-Book version</p>
        </cell>
      </row>
    </table>

Here are my extract parameters (using trafilatura 1.7.0) :

trafilatura.extract(downloaded, output_format='xml', include_formatting=True, include_links=True, include_tables=True)
adbar commented 5 months ago

Hi @obeone, indeed. The links were not my original focus and there are a few problems with link extraction.