AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
358 stars 32 forks source link

Problems creating a news_engine_pool from a list of urls' read from a file offline #133

Closed AndyTheFactory closed 6 months ago

AndyTheFactory commented 8 months ago

Issue by teejayvee Thu Oct 5 18:28:56 2017 Originally opened as https://github.com/codelucas/newspaper/issues/454


Hi,

Thanks so much for Newspaper and Newspaper3k! ;^D But it's still not easy enough for my Python3 skills. ;^(

I'm a very new Python programmer taking baby steps in iPython via Jupyter in an Anaconda Python v3.6 environment. I am also a 30+ year data scientist with an MS in statistics from a top 10 grad skool. I was a scientific programmer for Geophysical Fluid Dynamics Institute-NCAR on the FSU campus in Tallahassee, FL. Fortran. Lisp. Then I picked up VBasic, Turbo Pascal, ObjectiveC, then went the SAS, SPSS, Minitab, Mathematica, Matlab route for years. Now I need the scaling and linear-nonseparble aspects of some specific classification paradigms.

Please be kind. I suspect my issue is more Python than Newspaper3k!

As a naif, I've been experimenting with pooled, limited-thread news_engines. I don't seem to know how to take my list of configured news_engines to a pooled_news_engine like this;

(wurls are the urls from the 12 current news-sites)

news_pool_engines = []
for npurl in wurls:
    np = newspaper.build(npurl, memoize_articles=False, language='en')
    npe = news_pool_engines.append(copy.copy(np))

I thought I could append the 12 news_engine.objects using copy.copy with news_pool_engines.append(copy.copy(np))

Apparently this is not happening since npe is empty!

I had tried pooling 3 at a time and stepping through a shuffled list since my 'production' list of 90 url's I could step through 3 at a time thinking there was a limit ot how many news_engine objects one could cram into a news_enfine_pool. That approach had the same issue.

Can you help me? The documentation for news_engine pooling is a bit thin for my level of Python chops. I really do not know what to try next.

Thanks so much!!!

====== use case ============ Our text-analysis-and-classification software, Readware, has a current build with all the REST interfaces out (for reasons that will remain unsaid) leaving me with t the arduous (for me) task of spidering news sites for the first time ever to create local archives of and <body> in html format (download, parse, then build a stripped down ?.html as input for semiotic-semantics engine to . We are not interested in all the html given the wretched state of all the dhtml form content-recommendation garbage inside these pages. I am a naif in this.</p> <p>====== content of usa_news_brief_1.txt ====== <a rel="noreferrer nofollow" target="_blank" href="http://www.motherjones.com/">http://www.motherjones.com/</a> <a rel="noreferrer nofollow" target="_blank" href="https://www.huffingtonpost.com/">https://www.huffingtonpost.com/</a> <a rel="noreferrer nofollow" target="_blank" href="https://www.theatlantic.com/">https://www.theatlantic.com/</a> <a rel="noreferrer nofollow" target="_blank" href="https://www.wsj.com/news/us">https://www.wsj.com/news/us</a> <a rel="noreferrer nofollow" target="_blank" href="https://www.csmonitor.com/USA">https://www.csmonitor.com/USA</a> <a rel="noreferrer nofollow" target="_blank" href="https://www.vox.com/">https://www.vox.com/</a> <a rel="noreferrer nofollow" target="_blank" href="http://www.nationalreview.com/archives">http://www.nationalreview.com/archives</a> <a rel="noreferrer nofollow" target="_blank" href="https://www.usatoday.com/news/nation/">https://www.usatoday.com/news/nation/</a> <a rel="noreferrer nofollow" target="_blank" href="https://news.google.com/news/headlines?ned=us&hl=en">https://news.google.com/news/headlines?ned=us&hl=en</a> <a rel="noreferrer nofollow" target="_blank" href="https://www.yahoo.com/news/us/">https://www.yahoo.com/news/us/</a> <a rel="noreferrer nofollow" target="_blank" href="http://www.theblaze.com/">http://www.theblaze.com/</a> <a rel="noreferrer nofollow" target="_blank" href="https://www.citylab.com/posts/">https://www.citylab.com/posts/</a></p> <p>====== errors ======</p> <p>errror;</p> <h3>TypeError Traceback (most recent call last)</h3> <ipython-input-1-3dc729a8b878> in <module>() 25 npe = news_pool_engines.append(copy.copy(np)) 26 ---> 27 newspool = news_pool.set(npe,threads_per_source=2) 28 pooled_news = newspool.join() 29 /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/newspaper/mthreading.py in set(self, paper_list, threads_per_source) 101 def set(self, paper_list, threads_per_source=1): 102 self.papers = paper_list --> 103 num_threads = threads_per_source * len(self.papers) 104 timeout = self.config.thread_timeout_seconds 105 self.pool = ThreadPool(num_threads, timeout) TypeError: object of type 'NoneType' has no len() ====== code ====== ``` import time import datetime import newspaper from newspaper import news_pool import csv import os import random import copy pubdate = datetime.datetime.now().strftime("%d%b%Y") loc = '/Users/cosmo/Rawdata/Blofeld/test/NewsHtmlTest/' sports_tokens = ['basketball','baseball','football','soccer','','coaches','sports','preps'] with open('/Users/cosmo/Rawdata/Blofeld/usa_news_brief_1.txt') as f: wurls = f.read().splitlines() shuffles = 1 while shuffles < 4: random.shuffle(wurls) shuffles += 1 news_pool_engines = [] for npurl in wurls: np = newspaper.build(npurl, memoize_articles=False, language='en') npe = news_pool_engines.append(copy.copy(np)) newspool = news_pool.set(npe,threads_per_source=2) pooled_news = newspool.join() for article in enumerate(pooled_news.articles): try: article.download() except: continue try: article.parse() except: continue else: narticle = article.text ntitle = article.title nurl = article.url turl = article.source_url if sports_tokens in narticle: continue if isinstance(article.publish_date,datetime.time) == True: pubdate = article.publish_date.strftime("%d%b%Y") if 'https://' in turl: turl = turl.replace('www.','') turl = turl.replace('https://','') else: turl = turl.replace('www.','') turl = turl.replace('http://','') lfa = turl.translate(str.maketrans('./-','___')) lfw = loc+lfa+pubdate+'_'+str(count).zfill(5)+'.html' article_file = open(lfw,'w') html_string = '<!DOCTYPE html><html><head><title>'+ntitle+'<br><br>'+''+pubdate+'

'+nurl+'

'+narticle+'
' try: article_file.write(html_string) except: continue else: article_file.flush() os.fsync(article_file.fileno()) article_file.close() print('finished crawling and archiving all locales and papers!') ```

AndyTheFactory commented 6 months ago

multithreading model changed in v 0.9.2