codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Multithread extraction seems to fail at the news_pool.join section #838

Open EllieLockhart opened 4 years ago

EllieLockhart commented 4 years ago

I'm working on a project to extract articles from gaming media sites, and I'm doing a basic test run, which according to VSCode's debugger consistently hangs at the point after which I've set up a multi-threaded extraction (changing the number of threads does not help) on two sites. I'm honestly not sure what I'm doing wrong here; I followed the examples that have been laid out. One of the sites, Gamespot, is even used in someone's tutorial, and I tried removing the other (Polygon) and it doesn't seem to help. I've created a virtual environment and tried this with both Python 3.8 and 3.7. All dependencies appear to be satisfied; I also tested in in repl dot it and got the same hang.

I would love to hear I'm just doing something wrong so I can fix it; I really want to do some data science on these specific websites and their articles! But it seems as if, at least for an OS X user, there's some sort of bug with multithreading. Here's my code:

#import system functions
import sys
import requests
sys.path.append('/usr/local/lib/python3.8/site-packages/')
#import basic HTTP handling processes
#import urllib
#from urllib.request import urlopen
#import scraping libraries

#import newspaper and BS dependencies

from bs4 import BeautifulSoup
import newspaper
from newspaper import Article 
from newspaper import Source 
from newspaper import news_pool

#import broad data libraries
import pandas as pd

#import gaming related news sources as newspapers
gamespot = newspaper.build('https://www.gamespot.com/news', memoize_articles=False)
polygon = newspaper.build('https://www.polygon.com/gaming', memoize_articles=False)

#organize the gaming related news sources using a list
gamingPress = [gamespot, polygon]
print("About to set the pool.")
#parallel process these articles using multithreading (store in mem)
news_pool.set(gamingPress, threads_per_source=4)
print("Setting the pool")
news_pool.join()
print("Pool set")
#create the interim pandas dataframe based on these sources
final_df = pd.DataFrame()

#a limit on sources could be placed here; intentionally I have placed none
limit = 10

for source in gamingPress:
    #these are temporary placeholder lists for elements to be extracted
    list_title = []
    list_text = []
    list_source = []

    count = 0

    for article_extract in source.articles:
        article_extract.parse()

        #further limit functionality could be placed here; not placed
        if count > limit:
            break

        list_title.append(article_extract.title)
        list_text.append(article_extract.text)
        list_source.apprend(article_extract.source_url)

        print(count)
        count +=1 #progress the loop *via* count

    temp_df = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
    #Append this to the final DataFrame
    final_df = final_df.append(temp_df, ignore_index=True)

#export to CSV, placeholder for deeper analysis/more limited scope, may remain
final.df.to_csv('gaming_press.csv')

and here's what I get back when I finally give up and hit the interrupt at console:


About to set the pool.
Setting the pool
^X^X^CTraceback (most recent call last):
  File "scraper1.py", line 31, in <module>
    news_pool.join()
  File "/usr/local/lib/python3.8/site-packages/newspaper3k-0.3.0-py3.8.egg/newspaper/mthreading.py", line 103, in join
    self.pool.wait_completion()
  File "/usr/local/lib/python3.8/site-packages/newspaper3k-0.3.0-py3.8.egg/newspaper/mthreading.py", line 63, in wait_completion
    self.tasks.join()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/queue.py", line 89, in join
    self.all_tasks_done.wait()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
KeyboardInterrupt
EllieLockhart commented 4 years ago

Update on this: I used two example scripts by other people that did the same thing, but with different news sources, and they also hang at news_pool.join(). I'm trying to rebuild my entire local Python environment in case this is an issue of newspaper3k needing an older version of Python, but I'm not hopeful since cloud compilers failed too. The traceback was essentially the same with other peoples' scripts.