Open vulinh48936 opened 5 months ago
I just try to change
if len(result_body) > 1:
LOGGER.debug(expr)
break
in file https://github.com/adbar/trafilatura/blob/master/trafilatura/core.py and I could get all text with index 1. 2. ....
Can anyone explain why break loop when len(result_body) > 1?
Thank you.
Thank you for your feedback, the output is weird because the text is contained by a <div class="cmp-section__content">
element which isn't found by rule-based XPath expressions because it's rare or not really meaningful. So the extractor looks for text elements and gets confused because the original article uses multiple <div class="text">
where only one is expected.
My guess is that it's a similar problem as multiple <article>
elements, len(result_body) > 1
is used because usually adding elements introduces noise (teasers at the bottom, unrelated text, etc.).
How to tackle these segments is an open question, see #432 and #487. Feel free to try something out and draft a pull request if you're interested.
Hey @adbar, I have a similar problem, but with the site Stuff, it is only getting half of the content, because they are using the class 'stuff-article', which is very odd, I tried to add 'or contains(@class, '-article')' and it worked, but I not sure how broad it tag will be. Do you have any other suggestion?
Thank you.
@felipehertzer Can you try adding it to your PR in #509? ends-with(@class, '-article')
could work, I don't remember if it's supported by LXML.
@adbar I tested the ends-with
and LXML seems to do not support it, do you want me to include the contains(@class, "-article")
?
@felipehertzer Yes, let's try that.
for this url = "https://www.aia.com/en/health-wellness/healthy-living/healthy-mind/Managing-financial-stress", I use downloaded = trafilatura.fetch_url(url) trafilatura.bare_extraction(downloaded, url=url)
I get the text and this is a good result. However it only has text with index 1. while the website has text with index 1. 2. 3. 4. 5.
Even though I used favor_recall=True, nothing changed.
Thank you, however, for this library, it really is better than bs4!