adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.23k stars 239 forks source link

Extract more text #488

Open vulinh48936 opened 5 months ago

vulinh48936 commented 5 months ago

for this url = "https://www.aia.com/en/health-wellness/healthy-living/healthy-mind/Managing-financial-stress", I use downloaded = trafilatura.fetch_url(url) trafilatura.bare_extraction(downloaded, url=url)

I get the text and this is a good result. However it only has text with index 1. while the website has text with index 1. 2. 3. 4. 5.

Even though I used favor_recall=True, nothing changed.

Thank you, however, for this library, it really is better than bs4!

vulinh48936 commented 5 months ago

I just try to change

if len(result_body) > 1:
    LOGGER.debug(expr)
    break

in file https://github.com/adbar/trafilatura/blob/master/trafilatura/core.py and I could get all text with index 1. 2. ....

Can anyone explain why break loop when len(result_body) > 1?

Thank you.

adbar commented 5 months ago

Thank you for your feedback, the output is weird because the text is contained by a <div class="cmp-section__content"> element which isn't found by rule-based XPath expressions because it's rare or not really meaningful. So the extractor looks for text elements and gets confused because the original article uses multiple <div class="text"> where only one is expected.

My guess is that it's a similar problem as multiple <article> elements, len(result_body) > 1 is used because usually adding elements introduces noise (teasers at the bottom, unrelated text, etc.).

How to tackle these segments is an open question, see #432 and #487. Feel free to try something out and draft a pull request if you're interested.

felipehertzer commented 5 months ago

Hey @adbar, I have a similar problem, but with the site Stuff, it is only getting half of the content, because they are using the class 'stuff-article', which is very odd, I tried to add 'or contains(@class, '-article')' and it worked, but I not sure how broad it tag will be. Do you have any other suggestion?

Thank you.

adbar commented 5 months ago

@felipehertzer Can you try adding it to your PR in #509? ends-with(@class, '-article') could work, I don't remember if it's supported by LXML.

felipehertzer commented 5 months ago

@adbar I tested the ends-with and LXML seems to do not support it, do you want me to include the contains(@class, "-article")?

adbar commented 5 months ago

@felipehertzer Yes, let's try that.