adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.42k stars 251 forks source link

include_links breaks the extraction for https://news.ycombinator.com #411

Open shivanker opened 1 year ago

shivanker commented 1 year ago

Just as the title says. Attaching screenshot as an example.

Screenshot 2023-08-28 at 16 22 11
adbar commented 1 year ago

Hi @shivanker, extraction of main content from what is actually a summary page is tricky, but there is a bug here indeed.

HammadRafique29 commented 1 year ago

I go through the source code, and found out (windows) that the include_links feature is working well. The only problem is that the base base_url passed is somehow is None.

image

I have printed the target link which looks like this.

image

In above, you can see there is no base_url (used to create a relative url)

You can pass the Url paremeter to get the full url

image

Here is the output of above code:

image