codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.09k stars 2.11k forks source link

Unable to pull articles from list of article URL's #936

Open Unique201 opened 2 years ago

Unique201 commented 2 years ago

I have stored a list of previously used articles downloaded from various sites as URL's in a list, trying to iterate through the list to download each one but throws an article exception (I am pretty new to this so bare with me if it is a stupid mistake)

The URL is correct and works manually if I enter it with:

article = Article("https://www.infowars.com/posts/is-nato-a-dead-man-walking/") article.download() article.parse() article.nlp()

I have tried the following so far : url_list = url_df.values.tolist()

article = Article(str(url_list[1])) article.download() article.parse() article.nlp() print(article.url)

article = Article(url_list[1]) article.download() article.parse() article.nlp() print(article.url)

article = Article(json.dumps(url_list[1])) article.download() article.parse() article.nlp() print(article.url)

Each attempt throws the same error and I am not sure how to fix this

ArticleException Traceback (most recent call last)

in 4 article = Article(url) 5 article.download() ----> 6 article.parse() 7 article.nlp() ~\Anaconda3\lib\site-packages\newspaper\article.py in parse(self) 189 190 def parse(self): --> 191 self.throw_if_not_downloaded_verbose() 192 193 self.doc = self.config.get_parser().fromstring(self.html) ~\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self) 529 raise ArticleException('You must `download()` an article first!') 530 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE: --> 531 raise ArticleException('Article `download()` failed with %s on URL %s' % 532 (self.download_exception_msg, self.url)) 533 ArticleException: Article `download()` failed with No connection adapters were found for ':/"https:/www.infowars.com/posts/is-nato-a-dead-man-walking/"' on URL :/"https:/www.infowars.com/posts/is-nato-a-dead-man-walking/"
johnbumgarner commented 2 years ago

Do you see that the format of your URLs are wrong?

bad URL: https:/www.infowars.com/posts/is-nato-a-dead-man-walking/

good URL: https://www.infowars.com/posts/is-nato-a-dead-man-walking/

I haven't tried to parse this source, so I don't know what data elements extract and which one don't.