Open adbar opened 6 months ago
Usually the bottom section contains unwanted links, however here there is actual content to be found. Especially with include_links
on relevant parts are missing.
Thanks for opening up a bug issue for this. Also wondering if there are some settings I can play around with based on you mentioning "unwanted links". This might be related to a bug in there as it could be that these links are detected as unwanted
You could try favor_recall=True
as a parameter to the extraction function.
The culprit would be here, obviously the approach is limited as the fixed thresholds cannot work all the time: https://github.com/adbar/trafilatura/blob/3d0c934cbceb06ddd7b5d82a4eaaa1a2a1655318/trafilatura/htmlprocessing.py#L184
Discussed in https://github.com/adbar/trafilatura/discussions/516