Open minuscorp opened 9 years ago
In retrospect it was never a good idea to force multithreading in this library, should have left that to the user. I think having a config option to make everything single-threaded would be fair. Will add that to the priorities list.
I think the library is flexible enough that you can manually repeat the steps of newspaper.build
. This is a similar example showing you can just use the article.set_html()
method of article.download()
and replace newspaper's downloading framework with your own - in this case asyncio
with aiohttp
.
urls = [
'http://www.baltimorenews.net/index.php/sid/234363921',
'http://www.baltimorenews.net/index.php/sid/234323971',
'http://www.atlantanews.net/index.php/sid/234323891',
'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',
'http://www.tennessean.com/story/news/politics/2015/06/30/obama-focus-future-health-care-burwell-says/29540753/',
'http://www.atlantanews.net/index.php/sid/234323901',
'http://www.baltimorenews.net/index.php/sid/234323975',
'http://www.utsandiego.com/news/2015/jun/30/backcountry-lilac-development-opposition-general/',
'http://www.newsnet5.com/newsy/apples-ebook-pricing-scandal-a-long-road-to-a-small-fine',
'http://www.baltimorenews.net/index.php/sid/234323977',
'http://www.wsmv.com/story/29447077/trying-to-make-hitting-skid-disappear-maddon-hires-magician',
'http://www.atlantanews.net/index.php/sid/234323913',
'http://www.baltimorenews.net/index.php/sid/234323979',
'http://www.newsleader.com/story/sports/2015/06/30/virginia-baseball-fan-happy-proven-wrong/29540965/',
'http://www.baltimorenews.net/index.php/sid/234323981',
'http://www.baltimorenews.net/index.php/sid/234323987',
'http://www.mcall.com/entertainment/dining/mc-fratzolas-pizzeria-bethlehem-review-20150630-story.html',
'http://www.atlantanews.net/index.php/sid/234323911',
'http://www.baltimorenews.net/index.php/sid/234323985',
'http://www.atlantanews.net/index.php/sid/234323887',
'http://wtvr.com/2015/06/30/man-who-vandalized-confederate-statue-deeply-regrets-actions/',
'http://www.baltimorenews.net/index.php/sid/234323923',
'http://www.witn.com/home/headlines/Goldsboro-teens-charged-with-shooting-into-home-311067541.html',
'http://www.atlantanews.net/index.php/sid/234323995'
]
# ---------------------------------------------------------------- #
print("Synchronous")
import time, newspaper, hashlib
start_time = time.time()
for url in urls:
print(url)
article = newspaper.Article(url)
try:
article.download()
article.parse()
except newspaper.article.ArticleException:
continue
with open(hashlib.md5(url.encode('utf-8')).hexdigest() + ".txt", "w") as f:
f.write(article.text)
print("sync", time.time() - start_time, "\n")
# ---------------------------------------------------------------- #
print("Aynchronous")
import time, asyncio, aiohttp, newspaper, hashlib
start_time = time.time()
async def get_article(url):
print(url)
async with aiohttp.get(url) as response:
content = await response.read()
article = newspaper.Article(url)
article.set_html(content)
try:
article.parse()
except newspaper.article.ArticleException:
return
with open(hashlib.md5(url.encode('utf-8')).hexdigest() + ".txt", "w") as f:
f.write(article.text)
async def main(urls):
tasks = []
for url in urls:
task = asyncio.ensure_future(get_article(url))
tasks.append(task)
await asyncio.wait(tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(main(urls))
loop.close()
print("async", time.time() - start_time, "\n")
I checked the issue. It seems that this issue occurs only for newspaper on python2. Python3 (or newspaper3) doesn't produce any zombie threads.
@AlJohri I tried to implement the asynchronous portion of the code for a list of more than 1K urls. I am using Article from newspaper library to extract important information such as title, text, keywords and summary from news articles. Code worked fine for around 50 urls. But when more urls were passed, I am not getting response for most of the urls. I tried to return 'html' to see if aiohttp.session.get was generating response in the first place, which it was. I believe it has do something to do with Article from newspaper library and asyncio. However, I can't find the solution as asyncio is very new to me.
`class ContentExtractor():
def __init__(self,urls):
self.urls = urls
def content_to_dataframe(self):
results=[]
async def check_url_get_content(session,url):
result = {}
try:
async with session.get(url) as resp:
if resp.status==200:
html = await resp.read()
else:
html = 'none'
except:
html = 'none'
article = Article(url)
article.set_html(html)
try:
article.parse()
#article.nlp()
title = article.title
text = article.text
#keywords = article.keywords
#summary = article.summary
except:
title = 'none'
text = 'none'
#keywords = 'none'
#summary = 'none'
result['Title'] = title
result['Text'] = text
#result['Keywords'] = keywords
#result['Summary'] = summary
return result
async def main():
urls = self.urls
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
task = asyncio.ensure_future(get_content(session,url))
tasks.append(task)
responses = await asyncio.gather(*tasks)
return responses
loop = asyncio.new_event_loop()
results = loop.run_until_complete(main())
loop.close()
df = pd.DataFrame(results)
return df`
Is there any way of preventing the library to use a Pool of threads to get the news? It's a cool feature when you rely on speed and that kind of things, but for example, you can't combine this feature with an already asynchronous library like Celery, when, after a several hours of downloading articles, the thread pool is not cleaned (because Celery is already a thread and will not wait for any thread you create inside one of the tasks) and you end getting an Threading library error telling you that you can't create any new Thread. When I inspect the process tree in my computer, I see near 500 alive threads linked to Celery tasks but never ending them.