adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.67k stars 263 forks source link

Empty Results When Using Spider Function with Category URL #696

Open felipehertzer opened 2 months ago

felipehertzer commented 2 months ago

Hey @adbar,

I am currently testing the spider function, and I have encountered an issue when attempting to use a category URL to fetch posts specifically from that category.

Here is the code snippet that I am working with:

spider_results, _ = focused_crawler(
  homepage="https://www.australiandefence.com.au/news/news",
  max_seen_urls=1,
  max_known_urls=50,
  prune_xpath="//header | //footer",
)
print(spider_results)

The function returns empty results. After investigating, I believe the problem may lie in this line of code. I modified the line to:

if response.url not in homepage and response.url != "/":

This change resolved the issue, but It breaks the redirect function.

Thank you.

adbar commented 2 months ago

Hi @felipehertzer, I cannot reproduce the issue, I get results for your example with the latest version of the code (from the Github repository). Did you make other changes?

felipehertzer commented 2 months ago

Hey @adbar,

I have reinstalled it, but the issue persists.

When I run the following code, the variable new_base_url appears to be missing a value. Is this the same result you are getting?

htmlstring, homepage, new_base_url = probe_alternative_homepage(url)
print(homepage, new_base_url)  # result = /news/news ''
if htmlstring and homepage and new_base_url:
adbar commented 2 months ago

I still cannot reproduce it, probe_alternative_homepage() works as expected, it returns the HTML code, https://www.australiandefence.com.au/news/news and https://www.australiandefence.com.au.

Besides, the lines if response.url not in homepage and response.url != "/": you're suggesting is equivalent to the one in the code.

I guess the check probe_alternative_homepage() could be skipped if the input is not a homepage but the subsection of a website, but this is a different issue.

felipehertzer commented 2 months ago

Hello @adbar,

I apologise for the delayed response. I had some additional time to conduct further testing and identified the issue in the line below. I was able to do a fix on my side installing pycurl, because I was using urllib 2.2.3 instead. While PyCurl functions correctly, urllib does not.

Specifically, it seems that the geturl() method is not returning the complete URL; it only returns the path, such as /news/news. In contrast, PyCurl correctly returns the full URL: https://www.australiandefence.com.au/news/news.

Here is the line of code in question: https://github.com/adbar/trafilatura/blob/f57ef0b64b4cf96904e377eb012ebb38f097c518/trafilatura/downloads.py#L205

adbar commented 1 month ago

Thanks for the details, this is tricky, it may be a bug in urllib3. How do you think we can solve this?