Open dmoklaf opened 2 years ago
Whether the title is part of a text or not is an open question...
In this example the extractor goes for <div class="post">
(with <h1>
) and not for <article class="post-content">
(without).
Regardless of the extraction in itself it's doable to check if there is already a title in the metadata of length l
which is equal to body[:l]
.
Yes, that's what I am already doing (removing the title from the text beginning).
In some cases, Trafilatura repeats the document title in the body.
E.g. with this URL:
and this code:
Trafilatura sends back this title:
And this body:
As a workaround, I am currently doing this in my code to get rid of the repetition:
This shall be done within the Trafilatura library (unless I misunderstood what the "text" field means).