adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.5k stars 254 forks source link

List items are being missed #431

Open alroythalus opened 12 months ago

alroythalus commented 12 months ago

This is how I am using trafilatura (1.6.2)

    web_content = "".join(
            extract(
                original_html,
                include_formatting=True,
                include_tables=True,
                include_comments=False,
                include_links=False,
                output_format="xml"
            )
        )  # type: ignore

It is skipping all the list items from this URL https://www.wipro.com/privacy-statement/

for example:

Screenshot (726)

alroythalus commented 12 months ago

for this i fixed it using

web_content = "".join(
        extract(
            original_html,
            include_formatting=True,
            include_tables=True,
            include_comments=False,
            include_links=False,
            output_format="xml",
            favor_recall=True,
        )
    )  # type: ignore
adbar commented 12 months ago

I confirm that all lists are absent for this page, standard extraction fails on a structure of the type ul > li > span. here is an example:

<ul>
<li><span class="content-h7-freight-text-pro" tabindex="0">You have the right to know what personal information we maintain about you</span></li>
</ul>
alroythalus commented 11 months ago

Should this fix be added?

adbar commented 11 months ago

Feel free to draft a PR yes, otherwise I'll see when I have time to tackle this.

vbarbaresi commented 9 months ago

I didn't manage to reproduce the issue with a simple combination of ul > li > span

I was able to reproduce with a more complex snippet: it requires a longer text to trigger the algo heuristics:

import logging
import sys
from trafilatura import extract

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

text = 'formation we collect may be used to:</span></p>\n<ul>\n<li><span class="content-h7-freight-text-pro">provide information and services as requested by you.</span></li>\n<li><span class="content-h7-freight-text-pro">assess queries, requirements, and process requests for products and services.</span></li>\n<li><span class="content-h7-freight-text-pro">to provide subscription related services and information</span></li>\n<li><span class="content-h7-freight-text-pro">to enable you to download collaterals and Marketing materials.</span></li>\n<li><span class="content-h7-freight-text-pro">perform client communication, service, billing and administration.</span></li>\n<li><span class="content-h7-freight-text-pro">conduct data analysis.</span></li>\n<li><span class="content-h7-freight-text-pro">assess web site performance and usage analysis</span></li>\n<li><span class="content-h7-freight-text-pro">maintain leads</span></li>\n<li><span class="content-h7-freight-text-pro">run marketing or promotional campaigns</span></li>\n<li><span class="content-h7-freight-text-pro">create brand awareness</span></li>\n<li><span class="content-h7-freight-text-pro">provide better services and generate demand</span></li>\n<li><span class="content-h7-freight-text-pro">market products and services based on legitimate business interest under the applicable law; or</span></li>\n<li><span class="content-h7-freight-text-pro">conduct processing necessary to fulfil other contractual obligations for the individual.</span></li>\n</ul>\n<p><span class="content-h7-freight-text-pro">The legal basis may differ depending on applicable local laws, but generally we consider that our legitimate interests justify the processing; we find such interests to be justified considering that the data is limited to browsing activities related to what is considered business or professional related (our website does not offer any content directed to individual consumers as well as any content which might be used for any inferences about your private life habits or interests), we provide easy opt-out and limit the retention of data. The data that we may receive directly from you is completely voluntary and at your option. Where consent is required sp'

print(extract(text))  # <-- this does not work: list is removed
print("---")
print(extract(text[:-10]))  # <--- Remove 10 characters, extracting the list works

Here is the debug output in the failing case:

DEBUG:trafilatura.core:Recovering wild text elements
DEBUG:trafilatura.readability_lxml:Top 5: div 14.36
DEBUG:trafilatura.readability_lxml:Not removing ul of length 694
DEBUG:trafilatura.readability_lxml:Not removing div of length 1394
DEBUG:trafilatura.core:extracted length: 1394 (algorithm) 700 (extraction)
DEBUG:trafilatura.core:extraction values: 700 1394 for None
DEBUG:trafilatura.core:using custom extraction: None
DEBUG:trafilatura.core:not enough comments None

And here is the debug output in the OK case:

DEBUG:trafilatura.core:Recovering wild text elements
DEBUG:trafilatura.readability_lxml:Top 5: div 14.36
DEBUG:trafilatura.readability_lxml:Not removing ul of length 694
DEBUG:trafilatura.readability_lxml:Not removing div of length 1384
DEBUG:trafilatura.core:extracted length: 1384 (algorithm) 690 (extraction)
DEBUG:trafilatura.core:using generic algorithm: None
DEBUG:trafilatura.core:not enough comments None

The problem seems to be in this core heuristic: depending on the length we fall into the generic extraction case of the custom extraction case (with the trigger condition being: len_algo > 2 * len_text) I'm not familiar enough with this extraction logic to know if there is a way to tune this heuristic for this specific case.

I hope this helped nonetheless, if you have any pointers I can try to work on it.

adbar commented 9 months ago

A rule like len_algo > 2 * len_text is brittle but according to the benchmark it's mostly reliable. Edge cases like this one are an open question: How do we tackle them without impacting the rest?

tushar-srivastav commented 8 months ago

Can we extend the logic by maybe injecting a custom strategy something that can actually control the heuristic? This way we can actually implement our custom strategies and maybe even use AI driven algorithm to improve extraction accuracy. Please let me know. Thank you

adbar commented 8 months ago

So far the strategies are standard, "favor_recall" and "favor_precision", all offering a relatively good balance according to the benchmark. I don't plan to to tweak it further but feel free to draft a PR if you're interested.

tushar-srivastav commented 8 months ago

I tried to use both but with one of my case I am missing out on all important cards and links (possibly because on card the text content is limited) but the content is important nonetheless... Also, when using readability.js I am able to get proper data but with trafilatura all the important info is missed