fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
2.04k stars 422 forks source link

Exception on newsplease.examples.commoncrawl #79

Closed mehmetilker closed 5 years ago

mehmetilker commented 5 years ago

Describe the bug I have cloned repository and installed all the necessary libraries stated in requirements.txt and others like hurry after tried to run newsplease.examples.commoncrawl. Last error I got as follows:

(_env) C:\_Dev\_dev\temp\commoncrawl\news-please>py -m newsplease.examples.commoncrawl
INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request > .tmpaws.txt && awk '{ print $4 }' .tmpaws.txt && rm .tmpaws.txt
INFO:newsplease.crawler.commoncrawl_crawler:found 2 files at commoncrawl.org
INFO:newsplease.crawler.commoncrawl_crawler:creating extraction process pool with 1 processes
INFO:newsplease.crawler.commoncrawl_extractor:downloading https://commoncrawl.s3.amazonaws.com/'awk' is not recognized as an internal or external command, (local: ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2F%27awk%27+is+not+recognized+as+an+internal+or+external+command%2C)
Traceback (most recent call last):
  File "C:\_Dev\_Programs\Python\Python36-32\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\_Dev\_Programs\Python\Python36-32\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\_Dev\_dev\temp\commoncrawl\news-please\newsplease\examples\commoncrawl.py", line 123, in <module>
    continue_process=True)
  File "C:\_Dev\_dev\temp\commoncrawl\news-please\newsplease\crawler\commoncrawl_crawler.py", line 226, in crawl_from_commoncrawl
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "C:\_Dev\_dev\temp\commoncrawl\news-please\newsplease\crawler\commoncrawl_crawler.py", line 146, in __start_commoncrawl_extractor
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "C:\_Dev\_dev\temp\commoncrawl\news-please\newsplease\crawler\commoncrawl_extractor.py", line 328, in extract_from_commoncrawl
    self.__run()
  File "C:\_Dev\_dev\temp\commoncrawl\news-please\newsplease\crawler\commoncrawl_extractor.py", line 285, in __run
    local_path_name = self.__download(self.__warc_download_url)
  File "C:\_Dev\_dev\temp\commoncrawl\news-please\newsplease\crawler\commoncrawl_extractor.py", line 210, in __download
    urllib.request.urlretrieve(url, local_filepath, reporthook=self.__on_download_progress_update)
  File "C:\_Dev\_Programs\Python\Python36-32\lib\urllib\request.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "C:\_Dev\_Programs\Python\Python36-32\lib\urllib\request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "C:\_Dev\_Programs\Python\Python36-32\lib\urllib\request.py", line 532, in open
    response = meth(req, response)
  File "C:\_Dev\_Programs\Python\Python36-32\lib\urllib\request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\_Dev\_Programs\Python\Python36-32\lib\urllib\request.py", line 570, in error
    return self._call_chain(*args)
  File "C:\_Dev\_Programs\Python\Python36-32\lib\urllib\request.py", line 504, in _call_chain
    result = func(*args)
  File "C:\_Dev\_Programs\Python\Python36-32\lib\urllib\request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 505: HTTP Version not supported

I assume that it is the same problem specified here https://github.com/fhamborg/news-please/issues/36 and I have tried to install awscli but I got all "Requirement already satisfied" When I tried to run again I got same exception.

To Reproduce

git clone https://github.com/fhamborg/news-please.git
cd news-please
--I have changed some config values like domain and date
python3 -m newsplease.examples.commoncrawl
--later install missed libraries

Expected behavior Downloading news content from specified domain between specified date.

Versions (please complete the following information):

fhamborg commented 5 years ago

as you mention, this issue is not related to news-please but due to a non successful installation of awscli, so I need to refer you to their corresponding installation sites or support. one note, though, is that this issue is not related to the python package, but to the awscli itself (which needs to be installed by itself, e.g., for ubuntu it would be apt install awscli). please look up the corresponding installation routine for windows. probably here: https://docs.aws.amazon.com/cli/latest/userguide/install-windows.html#install-msi-on-windows

fhamborg commented 5 years ago

ps: thanks for the issue, though! :-) i've added a brief explanation to the readme.md (additionally to the one contained in the example script, which - however - some user may not have seen) that awscli needs to be installed.

mehmetilker commented 5 years ago

Thanks for the info. Problem wasn't abut awscli. I have installed and verified it. Exception was "/'awk' is not recognized as an internal or external command," awk is an default app in Linux I guess. So I have installed http://gnuwin32.sourceforge.net/packages/gawk.htm and add it to path but still getting "urllib.error.HTTPError: HTTP Error 505: HTTP Version not supported".

mehmetilker commented 5 years ago

I thing problem is about url construction as some stated here. https://stackoverflow.com/questions/23715943/python-http-error-505-http-version-not-supported

So urllib construct url well in ubuntu but not in windows.

Another thing is about url is that it says "INFO:newsplease.crawler.commoncrawl_extractor:downloading https://commoncrawl.s3.amazonaws.com/awk: '{ (local: ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2Fawk%3A+%27%7B)"

I thinks commoncrawl.s3.amazonaws.com/awk is not right path. if awk is and app it shouldn't be here.