AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
461 stars 44 forks source link

URL parsing compatibility #626

Open Jacey0 opened 7 months ago

Jacey0 commented 7 months ago

Hi,

I'm having trouble setting up the environment for this. I'm using a conda environment on Windows and get the same problem with python 3.9, 3.10 and 3.11. I also made sure to pip install with the requirements.txt here before running pip install newspaper4k.

I will encounter this first issue

File "c:\Users...\scrape_from_urls.py", line 1, in import newspaper File "C:\Users...\site-packages\newspaper__init.py", line 17, in from .api import ( File "C:\Users...\site-packages\newspaper\api.py", line 11, in from newspaper.article import Article File "C:\Users...\site-packages\newspaper\article.py", line 28, in from .extractors import ContentExtractor File "C:\Users...\site-packages\newspaper\extractors\init__.py", line 8, in from newspaper.extractors.content_extractor import ContentExtractor File "C:\Users...\site-packages\newspaper\extractors\content_extractor.py", line 8, in from newspaper.extractors.articlebody_extractor import ArticleBodyExtractor File "C:\Users...\site-packages\newspaper\extractors\articlebody_extractor.py", line 8, in import newspaper.extractors.defines as defines File "C:\Users...\site-packages\newspaper\extractors\defines.py", line 2, in from typing_extensions import TypedDict, NotRequired ModuleNotFoundError: No module named 'typing_extensions'

No biggie, just need to pip install typing-extensions, so the import works, but then it encounters another error later when I try to call newspaper.article with any url.

File "c:\Users...\scrape_from_urls.py", line 7, in article = newspaper.article(url) File "C:\Users...\site-packages\newspaper__init.py", line 61, in article a = Article(url, language=language, **kwargs) File "C:\Users...\site-packages\newspaper\article.py", line 195, in init__ scheme = urls.get_scheme(url) File "C:\Users...\site-packages\newspaper\urls.py", line 370, in get_scheme return urlparse(abs_url, **kwargs).scheme File "c:\Users...\lib\urllib\parse.py", line 399, in urlparse url, scheme, _coerce_result = _coerce_args(url, scheme) File "c:\Users...\lib\urllib\parse.py", line 136, in _coerce_args return _decode_args(args) + (_encode_result,) File "c:\Users...\lib\urllib\parse.py", line 120, in _decode_args return tuple(x.decode(encoding, errors) if x else '' for x in args) File "c:\Users...\lib\urllib\parse.py", line 120, in return tuple(x.decode(encoding, errors) if x else '' for x in args) AttributeError: 'builtin_function_or_method' object has no attribute 'decode'

I also tried newspaper3k and get a similar AttributeError so I'm wondering if I should be using a different urllib version (urllib3==1.26.18).

Would be great if these could be added to the requirements.txt. Thank you.

changchiyou commented 5 months ago

I encountered the error ModuleNotFoundError: No module named 'typing_extensions' while using M1 / Miniconda 3.10. However, I was able to resolve it by executing pip install typing_extensions. Following this, I did not encounter the error AttributeError: 'builtin_function_or_method' object has no attribute 'decode'.