cocrawler / cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Apache License 2.0
158 stars 31 forks source link

"ValueError('invalid hostname in url '+url) from None" when accessing internet archive CaptureObject.content #13

Closed codekoriko closed 4 years ago

codekoriko commented 4 years ago

It seems to happens only with ia as a source and not cc. It also quite seldom, i'd say once every 5000-8000 CaptureObject's content attribute access

my code that triggers:

for obj in cdx.iter(url=url_pattern, 
            from_ts=self.date_range[0], 
            to=self.date_range[1], 
            filter=self.filter):
    with open(f"{html_dir}/{obj.data['digest']}.html", mode="wb") as f_w:
        f_w.write(obj.content)

here is the full traceback:

Traceback (most recent call last):
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/SEER/utils/cdx_retriever.py", line 103, in _retrieve_content
    f_w.write(obj.content)
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/__init__.py", line 122, in content
    self._content = self.fetch_warc_record().content_stream().read()
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/__init__.py", line 107, in fetch_warc_record
    self.warc_record = fetch_wb_warc(self.data, wb=self.wb)
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/warc.py", line 118, in fetch_wb_warc
    resp = myrequests_get(wb_url, **kwargs)
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/myrequests.py", line 63, in myrequests_get
    raise ValueError('invalid hostname in url '+url) from None
ValueError: invalid hostname in url https://web.archive.org/web/20191120071924id_/https%3A//www.placeholder.com/news/articles/2018-04-26/article-news-content
wumpus commented 4 years ago

You're getting a DNS failure looking up "web.archive.org", not the first time, but the Nth time. And that's currently not retried. I'll have to add some code to distinguish the case of "misconfigured CDX hostname that will never work" from "I've successfully fetched from this host before so a DNS error should be retried".

wumpus commented 4 years ago

Version 0.9.28, just released, retries dns failures for IA and CC's known hostnames with enthusiasm.

Please open another issue if this doesn't work for you.