adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.65k stars 261 forks source link

some extraction duplicated in xml #634

Open fortyfourforty opened 4 months ago

fortyfourforty commented 4 months ago

hi,

I was setting a test site and playing with trafilatura and found a weird bug.

site URL: https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/ as this test site is only available for 2 days, so I also attached the simple Gutenberg block code below for you to replicate

Command:

html = trafilatura.fetch_url(url, no_ssl=True,)
ts = trafilatura.extract(html, output_format='xml', include_comments=False)

the Wordpress Gutenberg htmls below

<!-- wp:paragraph -->
<p>this is sample intro</p>
<!-- /wp:paragraph -->

<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">intro 2</h3>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>table below</p>
<!-- /wp:paragraph -->

<!-- wp:table -->
<figure class="wp-block-table"><table><tbody><tr><td>a</td><td>b</td><td></td></tr><tr><td>f</td><td>s</td><td>s</td></tr><tr><td>g</td><td></td><td>b</td></tr></tbody></table></figure>
<!-- /wp:table -->

<!-- wp:paragraph -->
<p>header table below</p>
<!-- /wp:paragraph -->

<!-- wp:table -->
<figure class="wp-block-table"><table><thead><tr><th>b</th><th>s</th><th>h</th></tr></thead><tbody><tr><td>a</td><td>b</td><td></td></tr><tr><td>f</td><td>s</td><td>s</td></tr><tr><td>g</td><td></td><td>b</td></tr></tbody></table></figure>
<!-- /wp:table -->

<!-- wp:paragraph -->
<p>list below</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul><!-- wp:list-item -->
<li>this is 1</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 2</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 3</li>
<!-- /wp:list-item --></ul>
<!-- /wp:list -->

<!-- wp:paragraph -->
<p>numbered list below</p>
<!-- /wp:paragraph -->

<!-- wp:list {"ordered":true} -->
<ol><!-- wp:list-item -->
<li>this is 1</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 2</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 3</li>
<!-- /wp:list-item --></ol>
<!-- /wp:list -->

It is very simple extraction but I find some elements are extracted twice. elements below "this is sample intro" appeared twice but not all of the elements appear twice. some of the list elements only show up once.

See the extraction below:

<doc sitename="milkfriends.s1-tastewp.com" title="ok this" author="Admin" date="2024-06-27" url="https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/" hostname="s1-tastewp.com" fingerprint="f69d7033beefe32d">
  <main>
    <p>this is sample intro</p>
    <head rend="h3">intro 2</head>
    <p>table below</p>
    <table>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>header table below</p>
    <table>
      <row span="3">
        <cell role="head">b</cell>
        <cell role="head">s</cell>
        <cell role="head">h</cell>
      </row>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>list below</p>
    <list rend="ul">
      <item>this is 1</item>
      <item>this is 2</item>
      <item>this is 3</item>
    </list>
    <p>numbered list below</p>
    <list rend="ol">
      <item>this is 1</item>
      <item>this is 2</item>
      <item>this is 3</item>
    </list>
    <p>this is sample intro</p>
    <p>table below</p>
    <table>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>header table below</p>
    <table>
      <row span="3">
        <cell role="head">b</cell>
        <cell role="head">s</cell>
        <cell role="head">h</cell>
      </row>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>list below</p>
    <p>numbered list below</p>
  </main>
</doc>
adbar commented 4 months ago

I'm not sure what happens here but this is odd indeed. Note that if you can use a web archive to reproduce the errors later.

In general, duplicated elements can be easily tackled by using the integrated deduplication filters and setting the right threshold.

fortyfourforty commented 4 months ago

sorry, I forgot about archive.is. Noted.

I don't think using deduplicate = True is a valid workaround as there are some pages that do have extact same text segments on the same page.

adbar commented 3 months ago

@fortyfourforty The integrated deduplication does prevent identical text segments on the same page.