adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.56k stars 256 forks source link

anchor issue #147

Closed pieterhartel closed 6 months ago

pieterhartel commented 2 years ago

It seems that sometimes a link without an href is ignored. Consider the sample html below:

$ cat anchor.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
</head>
<body>

<h1>FOO.</h1>
<p><strong>FOO!</strong></p>
<p>BE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - <a href="http://peyueomdqxfjxtpg.onion">http://peyueomdqxfjxtpg.onion</a> Please bookmark us.</p>

    <h1>The quick brown fox jumps over the lazy dog  1</h1>
    <a>The quick brown fox jumps over the lazy dog  2</a>
    <h1><a>The quick brown fox jumps over the lazy dog  3</a></h1>
    The quick brown fox jumps over the lazy dog  4
Lorem ipsum
</body></html>

I would expect four occurrences of "The quick brown fox jumps over the lazy dog', with numbers 1,2,3 and 4. But #3 is missing:

$ trafilatura --json --links <anchor.html 
{"title": "FOO.", "author": null, "hostname": null, "date": null, "categories": "", "tags": "",
"fingerprint": "O0AByIFzTc/NCqx2cgJPXyjnK3s=", "id": null, "license": null,
"raw-text": "FOO. FOO! BE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - http://peyueomdqxfjxtpg.onion Please bookmark us. The quick brown fox jumps over the lazy dog 1 The quick brown fox jumps over the lazy dog 2 The quick brown fox jumps over the lazy dog 4 Lorem ipsum",
"source": null, "source-hostname": null, "excerpt": null,
"text": "FOO.\nFOO!\nBE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - http://peyueomdqxfjxtpg.onion Please bookmark us.\nThe quick brown fox jumps over the lazy dog 1\nThe quick brown fox jumps over the lazy dog 2\nThe quick brown fox jumps over the lazy dog 4\nLorem ipsum",
"comments": ""}
adbar commented 2 years ago

@pieterhartel There was a small issue here which I fixed, the rest can be explained by the orphan text at the bottom. If you write <p>The quick brown fox jumps over the lazy dog 4</p> then you will see it in the output.

The reason is that trailing titles at the bottom of articles are discarded during extraction, it enhances the quality of extraction. In this particular case it does not work but in general it is not a bug IMO.

pieterhartel commented 2 years ago

@adbar wrote "The reason is that trailing titles at the bottom of articles are discarded during extraction". I don't think that this is the case. There is text following the <h1> element in the example. I installed the latest version of trafilatura and in the example added more text after the <h1> and the title still does not show up.

adbar commented 2 years ago

I get your point, but the last title in your example is followed by orphan text without a tag, so the last tag seen by the parser is <h1>.

chakravir commented 1 year ago

I don't know how common this is, but there are definitely pages where there is valuable text in h1 blocks lower in the page . An example I ran into is https://mywellself.ca/about-us .

Is there any workaround to extract the info in these multiple h1 blocks and include it ?

adbar commented 1 year ago

@chakravir Trafilatura tries to work in a generic way and there is only little potential for customization.