adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.69k stars 263 forks source link

Add option to provide XPaths for content extraction #596

Open klvbdmh opened 6 months ago

klvbdmh commented 6 months ago

I can't parse comments on Reddit pages (old.reddit.com). The issue is likely due to the comments being encapsulated within form elements, which the parser may not be handling correctly.

Steps to reproduce Run trafilatura -u "https://old.reddit.com/r/programming/comments/1cnvy7y/how_stripe_prevents_double_payment_using/"

Expected Behavior trafilatura should successfully parse and extract all visible Reddit comments.

Actual Behavior Only user names, points, and number of children are extracted:

[–]SittingWave 665 points666 points667 points (73 children)
[–]barbouk 55 points56 points57 points (2 children)
[–]caltheon 15 points16 points17 points (0 children)
[–]TheSameTrain 19 points20 points21 points (0 children)
[–]WannaBeRichieRich 95 points96 points97 points (67 children)
[...]

Adding a --recall flag doesn't change anything.

Is it possible to manually specify which elements should be parsed?

adbar commented 6 months ago

You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement. As for Reddit the extractor is not made for social networks, you could directly use Reddit datasets.

klvbdmh commented 6 months ago

You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement.

Yes, that could be a great improvement. I see other issues with unusual elements that have desirable content (#573). Could be better instead of hard-coding edge cases.