fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
2.05k stars 423 forks source link

fixes #172 and #169: NewsPlease.from_urls() - use multiprocessing #252

Closed arcolife closed 1 year ago

arcolife commented 1 year ago

also, do not attempt to extract info from empty html responses

ref: https://github.com/fhamborg/news-please/pull/173#issuecomment-1697734223

copperwiring commented 1 year ago

Thanks for the implementation, @arcolife. Code works for me.

This is my cpu core usage:

image

I don't know how many cores can/should it use in parallel because html parsing is a quick process but it does appear quicker.


@fhamborg There are times when we get 503 error because of server issue. We can add a retry extraction feature here like below:

    def __download_with_retries(self, url, local_filepath, max_retries=10, retry_delay=5):
        retry_count = 0
        while retry_count < max_retries:
            try:
                urllib.request.urlretrieve(url, local_filepath,reporthook= self.__on_download_progress_update)
                return local_filepath
            except urllib.error.HTTPError as e:
                if e.code == 503:  # Service Unavailable
                    print(f'HTTP Error 503: Service Unavailable - Retrying in {retry_delay} seconds')
                    time.sleep(retry_delay)
                    retry_count += 1
                else:
                    raise  # Re-raise the exception if it's not a 503 error
        print('Max retries reached, unable to download file')
        return None

    def __download(self, path):
        """
        Download and save a file locally.
        :param url: Where to download from
        :return: File path name of the downloaded file
        """
        local_filename = urllib.parse.quote_plus(path)
        local_filepath = os.path.join(self.__local_download_dir_warc, local_filename)

        if os.path.isfile(local_filepath) and self.__reuse_previously_downloaded_files:
            self.__logger.info("found local file %s, not downloading again due to configuration", local_filepath)
            return local_filepath
        else:
            # cleanup
            try:
                os.remove(local_filepath)
            except OSError:
                pass

            # download
            if self.__s3_client:
                with open(local_filepath, 'wb') as file_obj:
                    self.__s3_client.download_fileobj(self.__cc_bucket, path, file_obj)
                return local_filepath
            else:
                url = self.__cc_base_url + path
                self.__logger.info('downloading %s (local: %s)', url, local_filepath)
                local_filepath = self.__download_with_retries(url, local_filename)
                self.__logger.info('download completed, local file: %s', local_filepath)
                return local_filepath

It's a small feature. Can/should that be added to this same PR which @arcolife has initiated or should it be added separately (if it can be added)?

fhamborg commented 1 year ago

thanks for the PR! :)

arcolife commented 1 year ago

@copperwiring thanks for the confirmation!

re: cpu cores in your screenshot: maybe there weren't as many links? It's been a while since I ran this on a full-blown testbed. But it did use all the cores on my end last time (as you've seen in those other screenshots back then)

re: retry: You could add the retry in a separate PR if it's not a non-issue :)