Open klvbdmh opened 6 months ago
You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement. As for Reddit the extractor is not made for social networks, you could directly use Reddit datasets.
You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement.
Yes, that could be a great improvement. I see other issues with unusual elements that have desirable content (#573). Could be better instead of hard-coding edge cases.
I can't parse comments on Reddit pages (old.reddit.com). The issue is likely due to the comments being encapsulated within
form
elements, which the parser may not be handling correctly.Steps to reproduce Run
trafilatura -u "https://old.reddit.com/r/programming/comments/1cnvy7y/how_stripe_prevents_double_payment_using/"
Expected Behavior trafilatura should successfully parse and extract all visible Reddit comments.
Actual Behavior Only user names, points, and number of children are extracted:
Adding a
--recall
flag doesn't change anything.Is it possible to manually specify which elements should be parsed?