adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.58k stars 256 forks source link

links/urls are not apprearing using extract #636

Closed alroythalus closed 3 weeks ago

alroythalus commented 3 months ago
    extract(
        web_content,
        include_formatting=False,
        include_tables=True,
        include_comments=False,
        include_links=True,
        output_format="xml",
        favor_recall=True,
        config=config,
    )
)  # type: ignore

with this config urls are not showing up. What is the issu. How can it be fixed?

sites tested on https://openai.com/policies/privacy-policy/ https://docs.github.com/en/site-policy/privacy-policies/github-general-privacy-statement

@adbar

adbar commented 3 months ago

@alroythalus I just tested the Github example and the links are in the XML output, here is a small example:

To remove content or information you have publicly posted, please submit a <ref target="https://support.github.com/contact/private-information">Private Information Removal request</ref>.

I cannot reproduce the bug, can you see if it works for you or if you can provide more information?