Link section missed at bottom of page

adbar commented 6 months ago

Discussed in https://github.com/adbar/trafilatura/discussions/516

^{Originally posted by **mertdeveci5** February 29, 2024} I read that this might be a feature request hence sharing here if someone figured it out. On using `extract`, I use `include_links=True`. However the links in the website are not scraped for some reason. Not sure if I am using this in the wrong way so would appreciate anyone pointing me into the right direction. Example: ``` # import the necessary functions from trafilatura import fetch_url, extract, sitemaps from rich import print as rprint # grab a HTML file to extract data from URL = "https://jam.dev/careers" downloaded = fetch_url(URL) sitemap = sitemaps.sitemap_search(URL) # output main content and comments as plain text result = extract(downloaded) # change the output format to XML (allowing for preservation of document structure) result = extract(downloaded, include_links=True, output_format="xml") # discard potential comment and change the output to JSON extract(downloaded, output_format="json", include_comments=False) rprint(sitemap) rprint(result) ``` Here most of the text is scraped EXCEPT the part where job listings are listed. It is critical to get this content though.

adbar commented 6 months ago

Usually the bottom section contains unwanted links, however here there is actual content to be found. Especially with include_links on relevant parts are missing.

mertdeveci5 commented 6 months ago

Thanks for opening up a bug issue for this. Also wondering if there are some settings I can play around with based on you mentioning "unwanted links". This might be related to a bug in there as it could be that these links are detected as unwanted

adbar commented 6 months ago

You could try favor_recall=True as a parameter to the extraction function.

The culprit would be here, obviously the approach is limited as the fixed thresholds cannot work all the time: https://github.com/adbar/trafilatura/blob/3d0c934cbceb06ddd7b5d82a4eaaa1a2a1655318/trafilatura/htmlprocessing.py#L184

adbar / trafilatura

Link section missed at bottom of page #518

Discussed in https://github.com/adbar/trafilatura/discussions/516