adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.62k stars 258 forks source link

Title repeated in the body #220

Open dmoklaf opened 2 years ago

dmoklaf commented 2 years ago

In some cases, Trafilatura repeats the document title in the body.

E.g. with this URL:

https://sebastianraschka.com/blog/2022/confidence-intervals-for-ml.html

and this code:

extraction = trafilatura.core.bare_extraction(content_tree, include_comments=False)
title = extraction["title"]
body = extraction["text"]

Trafilatura sends back this title:

Creating Confidence Intervals for Machine Learning Classifiers

And this body:

Creating Confidence Intervals for Machine Learning Classifiers Developing good predictive models hinges upon accurate...

As a workaround, I am currently doing this in my code to get rid of the repetition:

if body is not None and title is not None and body.startswith(title):
    body = body[len(title) :].lstrip()

This shall be done within the Trafilatura library (unless I misunderstood what the "text" field means).

adbar commented 2 years ago

Whether the title is part of a text or not is an open question...

In this example the extractor goes for <div class="post"> (with <h1>) and not for <article class="post-content"> (without).

Regardless of the extraction in itself it's doable to check if there is already a title in the metadata of length l which is equal to body[:l].

dmoklaf commented 2 years ago

Yes, that's what I am already doing (removing the title from the text beginning).