Thanks so much for Newspaper and Newspaper3k! ;^D But it's still not easy enough for my Python3 skills. ;^(
I'm a very new Python programmer taking baby steps in iPython via Jupyter in an Anaconda Python v3.6 environment. I am also a 30+ year data scientist with an MS in statistics from a top 10 grad skool. I was a scientific programmer for Geophysical Fluid Dynamics Institute-NCAR on the FSU campus in Tallahassee, FL. Fortran. Lisp. Then I picked up VBasic, Turbo Pascal, ObjectiveC, then went the SAS, SPSS, Minitab, Mathematica, Matlab route for years. Now I need the scaling and linear-nonseparble aspects of some specific classification paradigms.
Please be kind. I suspect my issue is more Python than Newspaper3k!
As a naif, I've been experimenting with pooled, limited-thread news_engines. I don't seem to know how to take my list of configured news_engines to a pooled_news_engine like this;
(wurls are the urls from the 12 current news-sites)
news_pool_engines = []
for npurl in wurls:
np = newspaper.build(npurl, memoize_articles=False, language='en')
npe = news_pool_engines.append(copy.copy(np))
I thought I could append the 12 news_engine.objects using copy.copy with news_pool_engines.append(copy.copy(np))
Apparently this is not happening since npe is empty!
I had tried pooling 3 at a time and stepping through a shuffled list since my 'production' list of 90 url's I could step through 3 at a time thinking there was a limit ot how many news_engine objects one could cram into a news_enfine_pool. That approach had the same issue.
Can you help me? The documentation for news_engine pooling is a bit thin for my level of Python chops. I really do not know what to try next.
Thanks so much!!!
====== use case ============
Our text-analysis-and-classification software, Readware, has a current build with all the REST interfaces out (for reasons that will remain unsaid) leaving me with t the arduous (for me) task of spidering news sites for the first time ever to create local archives of
and in html format (download, parse, then build a stripped down ?.html as input for semiotic-semantics engine to . We are not interested in all the html given the wretched state of all the dhtml form content-recommendation garbage inside these pages. I am a naif in this.
Issue by teejayvee Thu Oct 5 18:28:56 2017 Originally opened as https://github.com/codelucas/newspaper/issues/454
Hi,
Thanks so much for Newspaper and Newspaper3k! ;^D But it's still not easy enough for my Python3 skills. ;^(
I'm a very new Python programmer taking baby steps in iPython via Jupyter in an Anaconda Python v3.6 environment. I am also a 30+ year data scientist with an MS in statistics from a top 10 grad skool. I was a scientific programmer for Geophysical Fluid Dynamics Institute-NCAR on the FSU campus in Tallahassee, FL. Fortran. Lisp. Then I picked up VBasic, Turbo Pascal, ObjectiveC, then went the SAS, SPSS, Minitab, Mathematica, Matlab route for years. Now I need the scaling and linear-nonseparble aspects of some specific classification paradigms.
Please be kind. I suspect my issue is more Python than Newspaper3k!
As a naif, I've been experimenting with pooled, limited-thread news_engines. I don't seem to know how to take my list of configured news_engines to a pooled_news_engine like this;
(wurls are the urls from the 12 current news-sites)
I thought I could append the 12 news_engine.objects using copy.copy with news_pool_engines.append(copy.copy(np))
Apparently this is not happening since npe is empty!
I had tried pooling 3 at a time and stepping through a shuffled list since my 'production' list of 90 url's I could step through 3 at a time thinking there was a limit ot how many news_engine objects one could cram into a news_enfine_pool. That approach had the same issue.
Can you help me? The documentation for news_engine pooling is a bit thin for my level of Python chops. I really do not know what to try next.
Thanks so much!!!
====== use case ============ Our text-analysis-and-classification software, Readware, has a current build with all the REST interfaces out (for reasons that will remain unsaid) leaving me with t the arduous (for me) task of spidering news sites for the first time ever to create local archives of
====== content of usa_news_brief_1.txt ====== http://www.motherjones.com/ https://www.huffingtonpost.com/ https://www.theatlantic.com/ https://www.wsj.com/news/us https://www.csmonitor.com/USA https://www.vox.com/ http://www.nationalreview.com/archives https://www.usatoday.com/news/nation/ https://news.google.com/news/headlines?ned=us&hl=en https://www.yahoo.com/news/us/ http://www.theblaze.com/ https://www.citylab.com/posts/
====== errors ======
errror;
TypeError Traceback (most recent call last)
'+'
'+nurl+'
'+narticle+'
' try: article_file.write(html_string) except: continue else: article_file.flush() os.fsync(article_file.fileno()) article_file.close() print('finished crawling and archiving all locales and papers!') ```