Open Psynbiotik opened 1 month ago
Hi @Psynbiotik, thanks for the detailed bug report.
Trafilatura is geared towards real-world cases and synthetic examples do not always work well. The problem here is that the text is near the limit of what counts as potentially successful extraction (250 chars), so another algorithm is triggered to retrieve additional content, which causes the issue.
Lowering the length threshold in the settings or removing a character from your example already fixes the problem.
Real-world pages are rather bloated than too short, the probability of finding a page like this in the wild is extremely low so this is not a concern. Attempting to fix this particular case however decreases accuracy for reasons I do not exactly understand.
It is still a problem though, we can leave the thread open until it is solved.
This is actually a reduced version of a real webpage. However, in the real webpage only the squishing of words together occurs, not the doubling issue.
Then it would be interesting to isolate the problem so that I can reproduce it. In your example both examples are linked to another.
Here is an example that only includes the word squishing issue, where whitespace between words is sometimes removed:
def test_white_space_issue():
from trafilatura import extract
html_string = """<!DOCTYPE html>
<html lang="en-us">
<body>
<main>
<section>
<p>First</p>
This gets Squished
<div>
<h4>There should be a space</h4>
<p>Another sentence</p>
This also gets Squished
</div>
<div>
<h4>Where is the space</h4>
<p>This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first. This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first </p>
</div>
</section>
</main>
</body>
</html>
"""
result = extract(html_string)
assert "SquishedThere" not in result
assert "SquishedWhere" not in result
This example shows that data is duplicated and words are squished together even though they are distinct in the html.
python:
This results in this: 'First This gets SquishedThere should be a space Another sentence This also gets SquishedWhere is the space This sentence has to be long enough. First This gets SquishedAnother sentence This also gets SquishedThis sentence has to be long enough.'
You can see First appears 2x even though it's in the html only once, same as some other sentences. Also several words get squished together with the space between them removed.