Closed arcolife closed 1 year ago
Thanks for the implementation, @arcolife. Code works for me.
This is my cpu core usage:
I don't know how many cores can/should it use in parallel because html parsing is a quick process but it does appear quicker.
@fhamborg There are times when we get 503 error because of server issue. We can add a retry extraction feature here like below:
def __download_with_retries(self, url, local_filepath, max_retries=10, retry_delay=5):
retry_count = 0
while retry_count < max_retries:
try:
urllib.request.urlretrieve(url, local_filepath,reporthook= self.__on_download_progress_update)
return local_filepath
except urllib.error.HTTPError as e:
if e.code == 503: # Service Unavailable
print(f'HTTP Error 503: Service Unavailable - Retrying in {retry_delay} seconds')
time.sleep(retry_delay)
retry_count += 1
else:
raise # Re-raise the exception if it's not a 503 error
print('Max retries reached, unable to download file')
return None
def __download(self, path):
"""
Download and save a file locally.
:param url: Where to download from
:return: File path name of the downloaded file
"""
local_filename = urllib.parse.quote_plus(path)
local_filepath = os.path.join(self.__local_download_dir_warc, local_filename)
if os.path.isfile(local_filepath) and self.__reuse_previously_downloaded_files:
self.__logger.info("found local file %s, not downloading again due to configuration", local_filepath)
return local_filepath
else:
# cleanup
try:
os.remove(local_filepath)
except OSError:
pass
# download
if self.__s3_client:
with open(local_filepath, 'wb') as file_obj:
self.__s3_client.download_fileobj(self.__cc_bucket, path, file_obj)
return local_filepath
else:
url = self.__cc_base_url + path
self.__logger.info('downloading %s (local: %s)', url, local_filepath)
local_filepath = self.__download_with_retries(url, local_filename)
self.__logger.info('download completed, local file: %s', local_filepath)
return local_filepath
It's a small feature. Can/should that be added to this same PR which @arcolife has initiated or should it be added separately (if it can be added)?
thanks for the PR! :)
@copperwiring thanks for the confirmation!
re: cpu cores in your screenshot: maybe there weren't as many links? It's been a while since I ran this on a full-blown testbed. But it did use all the cores on my end last time (as you've seen in those other screenshots back then)
re: retry: You could add the retry in a separate PR if it's not a non-issue :)
also, do not attempt to extract info from empty html responses
ref: https://github.com/fhamborg/news-please/pull/173#issuecomment-1697734223