adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.43k stars 251 forks source link

Faulty extraction for very short documents #660

Open Psynbiotik opened 1 month ago

Psynbiotik commented 1 month ago

This example shows that data is duplicated and words are squished together even though they are distinct in the html.

python:

from trafilatura import extract

html_string = """<!DOCTYPE html>
<html lang="en-us">
<body>
<main>
    <section>
        <p>First</p>
        This gets Squished
        <div>
            <h4>There should be a space</h4>
            <p>Another sentence</p>
            This also gets Squished
        </div>
        <div>
            <h4>Where is the space</h4>
            <p>This sentence has to be long enough.</p>
        </div>
    </section>
</main>
</body>
</html>
"""

print(extract(html_string))

This results in this: 'First This gets SquishedThere should be a space Another sentence This also gets SquishedWhere is the space This sentence has to be long enough. First This gets SquishedAnother sentence This also gets SquishedThis sentence has to be long enough.'

You can see First appears 2x even though it's in the html only once, same as some other sentences. Also several words get squished together with the space between them removed.

adbar commented 1 month ago

Hi @Psynbiotik, thanks for the detailed bug report.

Trafilatura is geared towards real-world cases and synthetic examples do not always work well. The problem here is that the text is near the limit of what counts as potentially successful extraction (250 chars), so another algorithm is triggered to retrieve additional content, which causes the issue.

Lowering the length threshold in the settings or removing a character from your example already fixes the problem.

Real-world pages are rather bloated than too short, the probability of finding a page like this in the wild is extremely low so this is not a concern. Attempting to fix this particular case however decreases accuracy for reasons I do not exactly understand.

It is still a problem though, we can leave the thread open until it is solved.

Psynbiotik commented 1 month ago

This is actually a reduced version of a real webpage. However, in the real webpage only the squishing of words together occurs, not the doubling issue.

adbar commented 1 month ago

Then it would be interesting to isolate the problem so that I can reproduce it. In your example both examples are linked to another.

Psynbiotik commented 1 month ago

Here is an example that only includes the word squishing issue, where whitespace between words is sometimes removed:

    def test_white_space_issue():
        from trafilatura import extract

        html_string = """<!DOCTYPE html>
        <html lang="en-us">
        <body>
        <main>
            <section>
                <p>First</p>
                This gets Squished
                <div>
                    <h4>There should be a space</h4>
                    <p>Another sentence</p>
                    This also gets Squished
                </div>
                <div>
                    <h4>Where is the space</h4>
                    <p>This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first. This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first </p>
                </div>
            </section>
        </main>
        </body>
        </html>
        """
        result = extract(html_string)
        assert "SquishedThere" not in result
        assert "SquishedWhere" not in result