fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
2.08k stars 428 forks source link

Cannot Install NewsPlease on fresh Ubuntu LTS #138

Closed tokotchd closed 4 years ago

tokotchd commented 4 years ago

Mandatory

Describe the bug cffi updates have broken the library? No error code printed, NewsPlease silently fails.

NewsPlease.from_url('any_url_here', timeout=30) Will always return None.

To Reproduce From a brand new Ubuntu 18.04.04 LTS install, if you install pip3, python3, and run pip3 install news-please, it installs with no issues. However, above statement will always return None without an Exception thrown.

Further investigation shows that reinstalling news-please will cause a segmentation fault within pip3.
Error occurs upon installation of cffi, this is likely the broken dependency that causes this to fail.

Expected behavior If cffi fails, would expect an exception being thrown rather than returning None for article contents.

Log

Collecting news-please
Collecting lxml>=3.3.5 (from news-please)
  Using cached https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9bc02364d08e5b84/lxml-4.5.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting bs4 (from news-please)
Collecting PyDispatcher>=2.0.5 (from news-please)
Collecting Scrapy>=1.1.0 (from news-please)
  Using cached https://files.pythonhosted.org/packages/3b/e4/69b87d7827abf03dea2ea984230d50f347b00a7a3897bc93f6ec3dafa494/Scrapy-1.8.0-py2.py3-none-any.whl
Collecting awscli>=1.11.117 (from news-please)
  Using cached https://files.pythonhosted.org/packages/b8/40/0d8bf5bbc9910ba5312b55f5962f76681387ecf296463fb9d96f6f68a516/awscli-1.18.3-py2.py3-none-any.whl
Collecting beautifulsoup4>=4.3.2 (from news-please)
  Using cached https://files.pythonhosted.org/packages/cb/a1/c698cf319e9cfed6b17376281bd0efc6bfc8465698f54170ef60a485ab5d/beautifulsoup4-4.8.2-py3-none-any.whl
Collecting PyMySQL>=0.7.9 (from news-please)
  Using cached https://files.pythonhosted.org/packages/ed/39/15045ae46f2a123019aa968dfcba0396c161c20f855f11dea6796bcaae95/PyMySQL-0.9.3-py2.py3-none-any.whl
Collecting hjson>=1.5.8 (from news-please)
Collecting plac>=0.9.6 (from news-please)
  Using cached https://files.pythonhosted.org/packages/86/85/40b8f66c2dd8f4fd9f09d59b22720cffecf1331e788b8a0cab5bafb353d1/plac-1.1.3-py2.py3-none-any.whl
Collecting newspaper3k>=0.2.8 (from news-please)
  Using cached https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl
Collecting elasticsearch>=2.4 (from news-please)
  Using cached https://files.pythonhosted.org/packages/10/60/0c79dde3e81beffeed422599d9ac65419289095186d37a3201739d52a57d/elasticsearch-7.5.1-py2.py3-none-any.whl
Collecting warcio>=1.3.3 (from news-please)
  Using cached https://files.pythonhosted.org/packages/90/c4/86bc02bc3bc33c34ab24e52af8a1c34eb6e03e7cd5b3904057ebcea311da/warcio-1.7.1-py2.py3-none-any.whl
Collecting readability-lxml>=0.6.2 (from news-please)
Collecting python-dateutil>=2.4.0 (from news-please)
  Using cached https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl
Collecting six>=1.10.0 (from news-please)
  Using cached https://files.pythonhosted.org/packages/65/eb/1f97cb97bfc2390a276969c6fae16075da282f5058082d4cb10c6c5c1dba/six-1.14.0-py2.py3-none-any.whl
Collecting hurry.filesize>=0.9 (from news-please)
Collecting langdetect>=1.0.7 (from news-please)
Collecting dotmap>=1.2.17 (from news-please)
  Using cached https://files.pythonhosted.org/packages/41/64/a63c863b674b3ce90af32632a1bec59e8b0d64c5afa9782ab7e5a5b6b33e/dotmap-1.3.13-py3-none-any.whl
Collecting ago>=0.0.9 (from news-please)
Collecting pyOpenSSL>=16.2.0 (from Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/9e/de/f8342b68fa9e981d348039954657bdf681b2ab93de27443be51865ffa310/pyOpenSSL-19.1.0-py2.py3-none-any.whl
Collecting cssselect>=0.9.1 (from Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96b87ce1ed09e92c5/cssselect-1.1.0-py2.py3-none-any.whl
Collecting parsel>=1.5.0 (from Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/86/c8/fc5a2f9376066905dfcca334da2a25842aedfda142c0424722e7c497798b/parsel-1.5.2-py2.py3-none-any.whl
Collecting zope.interface>=4.1.3 (from Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/16/1c/d9e4d1e4eb9777ae675c5ac01290e70012498944d5e743bd2777d1096ad7/zope.interface-4.7.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting service-identity>=16.0.0 (from Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/e9/7c/2195b890023e098f9618d43ebc337d83c8b38d414326685339eb024db2f6/service_identity-18.1.0-py2.py3-none-any.whl
Collecting queuelib>=1.4.2 (from Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/4c/85/ae64e9145f39dd6d14f8af3fa809a270ef3729f3b90b3c0cf5aa242ab0d4/queuelib-1.5.0-py2.py3-none-any.whl
Collecting w3lib>=1.17.0 (from Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/6a/45/1ba17c50a0bb16bd950c9c2b92ec60d40c8ebda9f3371ae4230c437120b6/w3lib-1.21.0-py2.py3-none-any.whl
Collecting protego>=0.1.15 (from Scrapy>=1.1.0->news-please)
Collecting cryptography>=2.0 (from Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/45/73/d18a8884de8bffdcda475728008b5b13be7fbef40a2acc81a0d5d524175d/cryptography-2.8-cp34-abi3-manylinux1_x86_64.whl
Collecting Twisted>=17.9.0; python_version >= "3.5" (from Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/88/e2/0c21fadf0dff02d145db02f24a6ed2c24993e7242d138babbca41de2f5a2/Twisted-19.10.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting docutils<0.16,>=0.10 (from awscli>=1.11.117->news-please)
  Using cached https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl
Collecting colorama<0.4.4,>=0.2.5; python_version != "3.4" (from awscli>=1.11.117->news-please)
  Using cached https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Collecting rsa<=3.5.0,>=3.1.2 (from awscli>=1.11.117->news-please)
  Using cached https://files.pythonhosted.org/packages/e1/ae/baedc9cb175552e95f3395c43055a6a5e125ae4d48a1d7a924baca83e92e/rsa-3.4.2-py2.py3-none-any.whl
Collecting s3transfer<0.4.0,>=0.3.0 (from awscli>=1.11.117->news-please)
  Using cached https://files.pythonhosted.org/packages/69/79/e6afb3d8b0b4e96cefbdc690f741d7dd24547ff1f94240c997a26fa908d3/s3transfer-0.3.3-py2.py3-none-any.whl
Collecting botocore==1.15.3 (from awscli>=1.11.117->news-please)
  Using cached https://files.pythonhosted.org/packages/6f/c7/4a7174b08e3cf644782641325db9bbe5fb5087c21d92486e5a7fd7ad4c2e/botocore-1.15.3-py2.py3-none-any.whl
Collecting PyYAML<5.3,>=3.10 (from awscli>=1.11.117->news-please)
Collecting soupsieve>=1.2 (from beautifulsoup4>=4.3.2->news-please)
  Using cached https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Collecting jieba3k>=0.35.1 (from newspaper3k>=0.2.8->news-please)
Collecting nltk>=3.2.1 (from newspaper3k>=0.2.8->news-please)
Collecting feedparser>=5.2.1 (from newspaper3k>=0.2.8->news-please)
Collecting Pillow>=3.3.0 (from newspaper3k>=0.2.8->news-please)
  Using cached https://files.pythonhosted.org/packages/19/5e/23dcc0ce3cc2abe92efd3cd61d764bee6ccdf1b667a1fb566f45dc249953/Pillow-7.0.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting tldextract>=2.0.1 (from newspaper3k>=0.2.8->news-please)
  Using cached https://files.pythonhosted.org/packages/fd/0e/9ab599d6e78f0340bb1d1e28ddeacb38c8bb7f91a1b0eae9a24e9603782f/tldextract-2.2.2-py2.py3-none-any.whl
Collecting tinysegmenter==0.3 (from newspaper3k>=0.2.8->news-please)
Collecting requests>=2.10.0 (from newspaper3k>=0.2.8->news-please)
  Using cached https://files.pythonhosted.org/packages/1a/70/1935c770cb3be6e3a8b78ced23d7e0f3b187f5cbfab4749523ed65d7c9b1/requests-2.23.0-py2.py3-none-any.whl
Collecting feedfinder2>=0.0.4 (from newspaper3k>=0.2.8->news-please)
Collecting urllib3>=1.21.1 (from elasticsearch>=2.4->news-please)
  Using cached https://files.pythonhosted.org/packages/e8/74/6e4f91745020f967d09332bb2b8b9b10090957334692eb88ea4afe91b77f/urllib3-1.25.8-py2.py3-none-any.whl
Collecting chardet (from readability-lxml>=0.6.2->news-please)
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting setuptools (from hurry.filesize>=0.9->news-please)
  Using cached https://files.pythonhosted.org/packages/3d/72/1c1498c1e908e0562b1e1cd30012580baa7d33b5b0ffdbeb5fde2462cc71/setuptools-45.2.0-py3-none-any.whl
Collecting pyasn1 (from service-identity>=16.0.0->Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl
Collecting attrs>=16.0.0 (from service-identity>=16.0.0->Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/a2/db/4313ab3be961f7a763066401fb77f7748373b6094076ae2bda2806988af6/attrs-19.3.0-py2.py3-none-any.whl
Collecting pyasn1-modules (from service-identity>=16.0.0->Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/95/de/214830a981892a3e286c3794f41ae67a4495df1108c3da8a9f62159b9a9d/pyasn1_modules-0.2.8-py2.py3-none-any.whl
Collecting cffi!=1.11.3,>=1.8 (from cryptography>=2.0->Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/f1/c7/72abda280893609e1ddfff90f8064568bd8bcb2c1770a9d5bb5edb2d1fea/cffi-1.14.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting constantly>=15.1 (from Twisted>=17.9.0; python_version >= "3.5"->Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/b9/65/48c1909d0c0aeae6c10213340ce682db01b48ea900a7d9fce7a7910ff318/constantly-15.1.0-py2.py3-none-any.whl
Collecting PyHamcrest>=1.9.0 (from Twisted>=17.9.0; python_version >= "3.5"->Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/ac/6c/a641af18e416e6501c10b03742387176626a1d48196100160df796f36632/PyHamcrest-2.0.0-py3-none-any.whl
Collecting Automat>=0.3.0 (from Twisted>=17.9.0; python_version >= "3.5"->Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/dd/83/5f6f3c1a562674d65efc320257bdc0873ec53147835aeef7762fe7585273/Automat-20.2.0-py2.py3-none-any.whl
Collecting incremental>=16.10.1 (from Twisted>=17.9.0; python_version >= "3.5"->Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/f5/1d/c98a587dc06e107115cf4a58b49de20b19222c83d75335a192052af4c4b7/incremental-17.5.0-py2.py3-none-any.whl
Collecting hyperlink>=17.1.1 (from Twisted>=17.9.0; python_version >= "3.5"->Scrapy>=1.1.0->news-please)
  Using cached https://files.pythonhosted.org/packages/7f/91/e916ca10a2de1cb7101a9b24da546fb90ee14629e23160086cf3361c4fb8/hyperlink-19.0.0-py2.py3-none-any.whl
Collecting jmespath<1.0.0,>=0.7.1 (from botocore==1.15.3->awscli>=1.11.117->news-please)
  Using cached https://files.pythonhosted.org/packages/83/94/7179c3832a6d45b266ddb2aac329e101367fbdb11f425f13771d27f225bb/jmespath-0.9.4-py2.py3-none-any.whl
Collecting idna (from tldextract>=2.0.1->newspaper3k>=0.2.8->news-please)
  Using cached https://files.pythonhosted.org/packages/89/e3/afebe61c546d18fb1709a61bee788254b40e736cff7271c7de5de2dc4128/idna-2.9-py2.py3-none-any.whl
Collecting requests-file>=1.4 (from tldextract>=2.0.1->newspaper3k>=0.2.8->news-please)
  Using cached https://files.pythonhosted.org/packages/23/9c/6e63c23c39e53d3df41c77a3d05a49a42c4e1383a6d2a5e3233161b89dbf/requests_file-1.4.3-py2.py3-none-any.whl
Collecting certifi>=2017.4.17 (from requests>=2.10.0->newspaper3k>=0.2.8->news-please)
  Using cached https://files.pythonhosted.org/packages/b9/63/df50cac98ea0d5b006c55a399c3bf1db9da7b5a24de7890bc9cfd5dd9e99/certifi-2019.11.28-py2.py3-none-any.whl
Collecting pycparser (from cffi!=1.11.3,>=1.8->cryptography>=2.0->Scrapy>=1.1.0->news-please)
Installing collected packages: lxml, soupsieve, beautifulsoup4, bs4, PyDispatcher, six, pycparser, cffi, cryptography, pyOpenSSL, cssselect, w3lib, parsel, setuptools, zope.interface, pyasn1, attrs, pyasn1-modules, service-identity, queuelib, protego, constantly, PyHamcrest, Automat, incremental, idna, hyperlink, Twisted, Scrapy, docutils, colorama, rsa, jmespath, urllib3, python-dateutil, botocore, s3transfer, PyYAML, awscli, PyMySQL, hjson, plac, jieba3k, nltk, feedparser, Pillow, certifi, chardet, requests, requests-file, tldextract, tinysegmenter, feedfinder2, newspaper3k, elasticsearch, warcio, readability-lxml, hurry.filesize, langdetect, dotmap, ago, news-please
Segmentation fault (core dumped)

Versions (please complete the following information):

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

tokotchd commented 4 years ago

Confirmed that forcing news-please==1.4.24 and cffi==1.13.2 makes the library work as intended.

fhamborg commented 4 years ago

Thanks for the issue report. Do you mean that cffi could not be installed correctly and thus newsplease fails when running NewsPlease.from_url('any_url_here', timeout=30).