adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.45k stars 252 forks source link

Link section missed at bottom of page #518

Open adbar opened 6 months ago

adbar commented 6 months ago

Discussed in https://github.com/adbar/trafilatura/discussions/516

Originally posted by **mertdeveci5** February 29, 2024 I read that this might be a feature request hence sharing here if someone figured it out. On using `extract`, I use `include_links=True`. However the links in the website are not scraped for some reason. Not sure if I am using this in the wrong way so would appreciate anyone pointing me into the right direction. Example: ``` # import the necessary functions from trafilatura import fetch_url, extract, sitemaps from rich import print as rprint # grab a HTML file to extract data from URL = "https://jam.dev/careers" downloaded = fetch_url(URL) sitemap = sitemaps.sitemap_search(URL) # output main content and comments as plain text result = extract(downloaded) # change the output format to XML (allowing for preservation of document structure) result = extract(downloaded, include_links=True, output_format="xml") # discard potential comment and change the output to JSON extract(downloaded, output_format="json", include_comments=False) rprint(sitemap) rprint(result) ``` Here most of the text is scraped EXCEPT the part where job listings are listed. It is critical to get this content though.
adbar commented 6 months ago

Usually the bottom section contains unwanted links, however here there is actual content to be found. Especially with include_links on relevant parts are missing.

mertdeveci5 commented 6 months ago

Thanks for opening up a bug issue for this. Also wondering if there are some settings I can play around with based on you mentioning "unwanted links". This might be related to a bug in there as it could be that these links are detected as unwanted

adbar commented 6 months ago

You could try favor_recall=True as a parameter to the extraction function.

The culprit would be here, obviously the approach is limited as the fixed thresholds cannot work all the time: https://github.com/adbar/trafilatura/blob/3d0c934cbceb06ddd7b5d82a4eaaa1a2a1655318/trafilatura/htmlprocessing.py#L184