[x] I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.
[x] I confirm that this bug report does not report on a specific news site where news-please does not work. Please keep in mind that news-please is a generic crawler so it is expected that it will not work for all sites well or even at all.
Describe the bugNewsPlease.from_urls behaves inconsistently in situations where a url results in 404. Does not behave how it's doc string suggests.
If passed a single url which results in 404, it returns an empty dictionary.
If passed multiple urls, one of which results in 404, it throws an error.
not a 200 response: 404
{"https://channelnomics.com/2018/03/two-realities-truth-and-fact-and-theyre-not-the-same/": None}
{'https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/': <NewsArticle.NewsArticle object>}
not a 200 response: 404
{"https://channelnomics.com/2018/03/two-realities-truth-and-fact-and-theyre-not-the-same/": None,
'https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/': <NewsArticle.NewsArticle object>}
Log
not a 200 response: 404
{}
{'https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/': <NewsArticle.NewsArticle object at 0x7f21e9364c50>}
not a 200 response: 404
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/runpy.py", line 198, in _run_module_as_main
return _run_code(code, main_globals, None,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/runpy.py", line 88, in _run_code
exec(code, run_globals)
File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
cli.main()
File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="__main__")
File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "/home/ubuntu/code/autocast/autocast_experiments/data/test.py", line 7, in <module>
print(NewsPlease.from_urls([url_1, url_2]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/__init__.py", line 145, in from_urls
results[url] = NewsPlease.from_html(results[url], url, download_date)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/__init__.py", line 103, in from_html
item = extractor.extract(item)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/pipeline/extractor/article_extractor.py", line 63, in extract
article_candidate = extractor.extract(item)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/pipeline/extractor/extractors/newspaper_extractor.py", line 36, in extract
article.parse()
File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newspaper/article.py", line 191, in parse
self.throw_if_not_downloaded_verbose()
File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newspaper/article.py", line 531, in throw_if_not_downloaded_verbose
raise ArticleException('Article `download()` failed with %s on URL %s' %
newspaper.article.ArticleException: Article `download()` failed with No connection adapters were found for '://' on URL ://
Versions (please complete the following information):
OS: Ubuntu 22.03
Python Version: 3.11
news-please: 1.5.33
Intent (optional; we'll use this info to prioritize upcoming tasks to work on)
[ ] personal
[x] academic
[ ] business
[ ] other
Some information on your project:
I am working on training LLM to make accurate probabilistic forecasts on forecast tournament style questions.
Mandatory
Describe the bug
NewsPlease.from_urls
behaves inconsistently in situations where a url results in 404. Does not behave how it's doc string suggests.To Reproduce
Expected behavior
Log
Versions (please complete the following information):
Intent (optional; we'll use this info to prioritize upcoming tasks to work on)