flairNLP / fundus

A very simple news crawler with a funny name
MIT License
127 stars 63 forks source link

[Bug]: installing via pip runs into Runtime error (event loop already running) #436

Closed jannichorst closed 1 month ago

jannichorst commented 1 month ago

Describe the bug

When installing version 0.2.2 via pip install fundus crawling anything runs into an RuntimeError: There is already an event loop running. This can be resolved by installing it manually from git like: pip install -e git+https://github.com/flairNLP/fundus.git@ff54845f204d74c3572311ca030ddd0a93df09b6#egg=fundus

How to reproduce

from fundus import PublisherCollection, Crawler # initialize the crawler for Washington Times
crawler = Crawler(PublisherCollection.us.WashingtonTimes)
# crawl 2 articles and print
for article in crawler.crawl(max_articles=1): # print article overview
   print(article)
   # print only the title
   print(article.title)

Expected behavior.

Fundus-Article:

Logs and Stack traces

AssertionError                            Traceback (most recent call last)
File ~/Documents/Master/NLP/exercise-1-data-crawling-and-bow-classifier-jannichorst/.venv/lib/python3.8/site-packages/fundus/utils/more_async.py:49, in ManagedEventLoop.__enter__(self)
     48     asyncio.get_running_loop()
---> 49     raise AssertionError()
     50 except RuntimeError:

AssertionError: 

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 4
      2 crawler = Crawler(PublisherCollection.us.WashingtonTimes)
      3 # crawl 2 articles and print
----> 4 for article in crawler.crawl(max_articles=1): # print article overview
      5    print(article)
      6    # print only the title

File ~/Documents/Master/NLP/exercise-1-data-crawling-and-bow-classifier-jannichorst/.venv/lib/python3.8/site-packages/fundus/scraping/pipeline.py:204, in BaseCrawler.crawl(self, max_articles, error_handling, only_complete, delay, url_filter, only_unique)
    166 """Yields articles from initialized scrapers
    167 
    168 Args:
   (...)
    192     Iterator[Article]: An iterator yielding objects of type Article.
    193 """
    195 async_article_iter = self.crawl_async(
    196     max_articles=max_articles,
    197     error_handling=error_handling,
   (...)
    201     only_unique=only_unique,
    202 )
--> 204 with ManagedEventLoop() as runner:
    205     while True:
    206         try:

File ~/Documents/Master/NLP/exercise-1-data-crawling-and-bow-classifier-jannichorst/.venv/lib/python3.8/site-packages/fundus/utils/more_async.py:53, in ManagedEventLoop.__enter__(self)
     51     self.event_loop = asyncio.new_event_loop()
     52 except AssertionError:
---> 53     raise RuntimeError(
     54         "There is already an event loop running. If you want to crawl articles inside an "
     55         "async environment use crawl_async() instead."
     56     )
     57 return self.event_loop

RuntimeError: There is already an event loop running. If you want to crawl articles inside an async environment use crawl_async() instead.

Screenshots

No response

Additional Context

No response

Environment

macOS Sonoma 14.3 (M1)
Python: 3.8.16

aiohttp==3.9.5
aioitertools==0.11.0
aiosignal==1.3.1
appnope==0.1.4
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.2.0
backcall==0.2.0
Brotli==1.1.0
certifi==2024.2.2
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
comm==0.2.2
cssselect==1.2.0
debugpy==1.8.1
decorator==5.1.1
dill==0.3.8
executing==2.0.1
FastWARC==0.14.6
feedparser==6.0.11
frozenlist==1.4.1
fundus==0.2.2
idna==3.7
importlib_metadata==7.1.0
ipykernel==6.29.4
ipython==8.12.3
jedi==0.19.1
jupyter_client==8.6.1
jupyter_core==5.7.2
langdetect==1.0.9
lxml==4.9.4
matplotlib-inline==0.1.7
more-itertools==9.1.0
multidict==6.0.5
nest-asyncio==1.6.0
packaging==24.0
parso==0.8.4
pexpect==4.9.0
pickleshare==0.7.5
platformdirs==4.2.0
prompt-toolkit==3.0.43
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
Pygments==2.17.2
python-dateutil==2.9.0.post0
pyzmq==26.0.2
requests==2.31.0
sgmllib3k==1.0.0
six==1.16.0
stack-data==0.6.3
tornado==6.4
tqdm==4.66.2
traitlets==5.14.3
typing_extensions==4.11.0
urllib3==2.2.1
validators==0.28.1
wcwidth==0.2.13
yarl==1.9.4
zipp==3.18.1
MaxDall commented 1 month ago

Hey @jannichorst,

It seems that you're using Fundus in an async context. Most likely google colab? If not please let me know and I further investigate the issue. Fundus 0.2.2 utilizes asyncio and won't work in an already running event loop using crawl due to the limitations of asyncio. We recently #357 got rid of Fundus' async logic, but a new release is yet to come. You can either checkout the latest master branch (as you already mentioned :) ) or utilize Fundus' async interface (see also #344):

from fundus import Crawler, PublisherCollection

crawler = Crawler(*PublisherCollection.us.WashingtonTimes)
async for article in crawler.crawl_async(max_articles=10):
  print(article)

Thanks for reporting this anyway :)

MaxDall commented 1 month ago

I released version 0.3.0 to PyPi. You should now be able to install and run Fundus within an asynchronous context from PyPi again.

jannichorst commented 1 month ago

Thanks @MaxDall! I was working out of a notebook in VS Code. I reported it because it took me too much time to figure out why the exact same code was running in one project but not in the other to figure out that it was the installed version on pypi. Can assume others might ran into the same problem. Thanks for reacting so quickly. I will check out the new version shortly.

PS: I tried crawl_async under 0.2.2 and it ran into issues as well.