Open felipehertzer opened 2 months ago
Hi @felipehertzer, I cannot reproduce the issue, I get results for your example with the latest version of the code (from the Github repository). Did you make other changes?
Hey @adbar,
I have reinstalled it, but the issue persists.
When I run the following code, the variable new_base_url
appears to be missing a value. Is this the same result you are getting?
htmlstring, homepage, new_base_url = probe_alternative_homepage(url)
print(homepage, new_base_url) # result = /news/news ''
if htmlstring and homepage and new_base_url:
I still cannot reproduce it, probe_alternative_homepage()
works as expected, it returns the HTML code, https://www.australiandefence.com.au/news/news
and https://www.australiandefence.com.au
.
Besides, the lines if response.url not in homepage and response.url != "/":
you're suggesting is equivalent to the one in the code.
I guess the check probe_alternative_homepage()
could be skipped if the input is not a homepage but the subsection of a website, but this is a different issue.
Hello @adbar,
I apologise for the delayed response. I had some additional time to conduct further testing and identified the issue in the line below. I was able to do a fix on my side installing pycurl
, because I was using urllib 2.2.3
instead. While PyCurl functions correctly, urllib does not.
Specifically, it seems that the geturl()
method is not returning the complete URL; it only returns the path, such as /news/news
. In contrast, PyCurl correctly returns the full URL: https://www.australiandefence.com.au/news/news
.
Here is the line of code in question: https://github.com/adbar/trafilatura/blob/f57ef0b64b4cf96904e377eb012ebb38f097c518/trafilatura/downloads.py#L205
Thanks for the details, this is tricky, it may be a bug in urllib3. How do you think we can solve this?
Hey @adbar,
I am currently testing the spider function, and I have encountered an issue when attempting to use a category URL to fetch posts specifically from that category.
Here is the code snippet that I am working with:
The function returns empty results. After investigating, I believe the problem may lie in this line of code. I modified the line to:
This change resolved the issue, but It breaks the redirect function.
Thank you.