fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
2.05k stars 423 forks source link

NewsPlease.from_urls behaves inconsistently in situations where a url results in 404 #243

Closed loganamcnichols closed 1 year ago

loganamcnichols commented 1 year ago

Mandatory

Describe the bug NewsPlease.from_urls behaves inconsistently in situations where a url results in 404. Does not behave how it's doc string suggests.

  1. If passed a single url which results in 404, it returns an empty dictionary.
  2. If passed multiple urls, one of which results in 404, it throws an error.

To Reproduce

from newsplease import NewsPlease

url_1 = "https://channelnomics.com/2018/03/two-realities-truth-and-fact-and-theyre-not-the-same/"
url_2 = "https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/"
print(NewsPlease.from_urls([url_1]))
print(NewsPlease.from_urls([url_2]))
print(NewsPlease.from_urls([url_1, url_2]))

Expected behavior

not a 200 response: 404
{"https://channelnomics.com/2018/03/two-realities-truth-and-fact-and-theyre-not-the-same/": None}
{'https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/': <NewsArticle.NewsArticle object>}
not a 200 response: 404
{"https://channelnomics.com/2018/03/two-realities-truth-and-fact-and-theyre-not-the-same/": None,
'https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/': <NewsArticle.NewsArticle object>}

Log

not a 200 response: 404
{}
{'https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/': <NewsArticle.NewsArticle object at 0x7f21e9364c50>}
not a 200 response: 404
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/code/autocast/autocast_experiments/data/test.py", line 7, in <module>
    print(NewsPlease.from_urls([url_1, url_2]))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/__init__.py", line 145, in from_urls
    results[url] = NewsPlease.from_html(results[url], url, download_date)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/__init__.py", line 103, in from_html
    item = extractor.extract(item)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/pipeline/extractor/article_extractor.py", line 63, in extract
    article_candidate = extractor.extract(item)
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/pipeline/extractor/extractors/newspaper_extractor.py", line 36, in extract
    article.parse()
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newspaper/article.py", line 191, in parse
    self.throw_if_not_downloaded_verbose()
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newspaper/article.py", line 531, in throw_if_not_downloaded_verbose
    raise ArticleException('Article `download()` failed with %s on URL %s' %
newspaper.article.ArticleException: Article `download()` failed with No connection adapters were found for '://' on URL ://

Versions (please complete the following information):

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)