GateNLP / ultimate-sitemap-parser

Ultimate Website Sitemap Parser
https://mediacloud.org/
Other
181 stars 64 forks source link

SSL Certificate error fix? #33

Open ma26yank opened 2 years ago

ma26yank commented 2 years ago

I was testing this package for a web crawler I was building. But at times it gives below error. Is there any argument I have to pass or is this a bug?

_IndexWebsiteSitemap(url=https://www.crummy.com/, sub_sitemaps=[InvalidSitemap(url=https://www.crummy.com/robots.txt, reason=Unable to fetch sitemap from https://www.crummy.com/robots.txt: HTTPSConnectionPool(host='www.crummy.com', port=443): Max retries exceeded with url: /robots.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (ssl.c:1131)'))))])

what I am trying is:

from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage("https://www.crummy.com")
print(tree)
japherwocky commented 1 year ago

You can subclass your own RequestsWebClient, and in particular in the get method use requests.get( ... , verify=False)

Then do something like sitemap_tree_for_homepage('https://www.crummy.com', web_client=MyClient())