codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.09k stars 2.11k forks source link

Synchronous mode. #120

Open minuscorp opened 9 years ago

minuscorp commented 9 years ago

Is there any way of preventing the library to use a Pool of threads to get the news? It's a cool feature when you rely on speed and that kind of things, but for example, you can't combine this feature with an already asynchronous library like Celery, when, after a several hours of downloading articles, the thread pool is not cleaned (because Celery is already a thread and will not wait for any thread you create inside one of the tasks) and you end getting an Threading library error telling you that you can't create any new Thread. When I inspect the process tree in my computer, I see near 500 alive threads linked to Celery tasks but never ending them.

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 437, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/revuelta/web/piar/web/gui/tasks.py", line 654, in newspaper_extractor_task
    extractor.newspaper_extraction()
  File "/home/revuelta/web/piar/web/libraries/NewsExtractor.py", line 172, in newspaper_extraction
    news_source = newspaper.build(self.base_url, config = self.config)
  File "/usr/local/lib/python2.7/dist-packages/newspaper/api.py", line 31, in build
    s.build()
  File "/usr/local/lib/python2.7/dist-packages/newspaper/source.py", line 96, in build
    self.download_categories()  # mthread
  File "/usr/local/lib/python2.7/dist-packages/newspaper/source.py", line 155, in download_categories
    requests = network.multithread_request(category_urls, self.config)
  File "/usr/local/lib/python2.7/dist-packages/newspaper/network.py", line 97, in multithread_request
    pool = ThreadPool(num_threads)
  File "/usr/local/lib/python2.7/dist-packages/newspaper/mthreading.py", line 49, in __init__
    Worker(self.tasks)
  File "/usr/local/lib/python2.7/dist-packages/newspaper/mthreading.py", line 25, in __init__
    self.start()
  File "/usr/lib/python2.7/threading.py", line 495, in start
    _start_new_thread(self.__bootstrap, ())
error: can't start new thread
codelucas commented 9 years ago

In retrospect it was never a good idea to force multithreading in this library, should have left that to the user. I think having a config option to make everything single-threaded would be fair. Will add that to the priorities list.

AlJohri commented 8 years ago

I think the library is flexible enough that you can manually repeat the steps of newspaper.build. This is a similar example showing you can just use the article.set_html() method of article.download() and replace newspaper's downloading framework with your own - in this case asyncio with aiohttp.

urls = [
    'http://www.baltimorenews.net/index.php/sid/234363921',
    'http://www.baltimorenews.net/index.php/sid/234323971',
    'http://www.atlantanews.net/index.php/sid/234323891',
    'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',
    'http://www.tennessean.com/story/news/politics/2015/06/30/obama-focus-future-health-care-burwell-says/29540753/',
    'http://www.atlantanews.net/index.php/sid/234323901',
    'http://www.baltimorenews.net/index.php/sid/234323975',
    'http://www.utsandiego.com/news/2015/jun/30/backcountry-lilac-development-opposition-general/',
    'http://www.newsnet5.com/newsy/apples-ebook-pricing-scandal-a-long-road-to-a-small-fine',
    'http://www.baltimorenews.net/index.php/sid/234323977',
    'http://www.wsmv.com/story/29447077/trying-to-make-hitting-skid-disappear-maddon-hires-magician',
    'http://www.atlantanews.net/index.php/sid/234323913',
    'http://www.baltimorenews.net/index.php/sid/234323979',
    'http://www.newsleader.com/story/sports/2015/06/30/virginia-baseball-fan-happy-proven-wrong/29540965/',
    'http://www.baltimorenews.net/index.php/sid/234323981',
    'http://www.baltimorenews.net/index.php/sid/234323987',
    'http://www.mcall.com/entertainment/dining/mc-fratzolas-pizzeria-bethlehem-review-20150630-story.html',
    'http://www.atlantanews.net/index.php/sid/234323911',
    'http://www.baltimorenews.net/index.php/sid/234323985',
    'http://www.atlantanews.net/index.php/sid/234323887',
    'http://wtvr.com/2015/06/30/man-who-vandalized-confederate-statue-deeply-regrets-actions/',
    'http://www.baltimorenews.net/index.php/sid/234323923',
    'http://www.witn.com/home/headlines/Goldsboro-teens-charged-with-shooting-into-home-311067541.html',
    'http://www.atlantanews.net/index.php/sid/234323995'
]

# ---------------------------------------------------------------- #

print("Synchronous")

import time, newspaper, hashlib

start_time = time.time()

for url in urls:
    print(url)
    article = newspaper.Article(url)
    try:
        article.download()
        article.parse()
    except newspaper.article.ArticleException:
        continue

    with open(hashlib.md5(url.encode('utf-8')).hexdigest() + ".txt", "w") as f:
        f.write(article.text)

print("sync", time.time() - start_time, "\n")

# ---------------------------------------------------------------- #

print("Aynchronous")

import time, asyncio, aiohttp, newspaper, hashlib

start_time = time.time()

async def get_article(url):
    print(url)

    async with aiohttp.get(url) as response:
        content = await response.read()
        article = newspaper.Article(url)
        article.set_html(content)
        try:
            article.parse()
        except newspaper.article.ArticleException:
            return

        with open(hashlib.md5(url.encode('utf-8')).hexdigest() + ".txt", "w") as f:
            f.write(article.text)

async def main(urls):
    tasks = []
    for url in urls:
        task = asyncio.ensure_future(get_article(url))
        tasks.append(task)
    await asyncio.wait(tasks)

loop = asyncio.get_event_loop()
loop.run_until_complete(main(urls))
loop.close()

print("async", time.time() - start_time, "\n")
korycins commented 6 years ago

I checked the issue. It seems that this issue occurs only for newspaper on python2. Python3 (or newspaper3) doesn't produce any zombie threads.

aashishkhadka1992 commented 5 years ago

@AlJohri I tried to implement the asynchronous portion of the code for a list of more than 1K urls. I am using Article from newspaper library to extract important information such as title, text, keywords and summary from news articles. Code worked fine for around 50 urls. But when more urls were passed, I am not getting response for most of the urls. I tried to return 'html' to see if aiohttp.session.get was generating response in the first place, which it was. I believe it has do something to do with Article from newspaper library and asyncio. However, I can't find the solution as asyncio is very new to me.

`class ContentExtractor():

def __init__(self,urls):
    self.urls = urls

def content_to_dataframe(self):
    results=[]              

    async def check_url_get_content(session,url):
        result = {}
        try:
            async with session.get(url) as resp:
                if resp.status==200:
                    html = await resp.read()                        
                else:
                    html = 'none'                  
        except:
            html = 'none'
        article = Article(url)
        article.set_html(html)
        try:
            article.parse()
            #article.nlp()

            title = article.title
            text = article.text
            #keywords = article.keywords
            #summary = article.summary

        except:
            title = 'none'
            text = 'none'
            #keywords = 'none'
            #summary = 'none'        

        result['Title'] = title
        result['Text'] = text
        #result['Keywords'] = keywords
        #result['Summary'] = summary

        return result

    async def main():
        urls = self.urls
        tasks = []
        async with aiohttp.ClientSession() as session:
            for url in urls:
                task = asyncio.ensure_future(get_content(session,url))
                tasks.append(task)

            responses = await asyncio.gather(*tasks)

        return responses

    loop = asyncio.new_event_loop()                                        
    results = loop.run_until_complete(main())
    loop.close()
    df = pd.DataFrame(results)

    return df`