flairNLP / fundus

A very simple news crawler with a funny name
MIT License
257 stars 70 forks source link

[Bug]: Fundus not installing on Google Colab #344

Closed alanakbik closed 3 months ago

alanakbik commented 5 months ago

Describe the bug

I tried running the tutorial code in a fresh colab environment, but when running

pip install fundus

the installation fails with the output

google-colab 1.0.0 requires requests==2.31.0, but you have requests 2.28.2 which is incompatible.
yfinance 0.2.36 requires requests>=2.31, but you have requests 2.28.2 which is incompatible. 

How to reproduce

pip install fundus

from fundus import PublisherCollection, Crawler

# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

Expected behavior.

It installs correctly and I can run all tutorials on Google Colab.

Logs and Stack traces

No response

Screenshots

No response

Additional Context

No response

Environment

Python 3.10.12, CPU-only, Google Colab
MaxDall commented 5 months ago

@alanakbik Thanks for reporting this.

Sadly I couldn't reproduce the exact error. I get the same message but Fundus seems to be installed anyway.

Just to be clear: Could you run this code snippet after a fresh pip install fundus

from fundus import Crawler, PublisherCollection

async def crawl():
  crawler = Crawler(*PublisherCollection.de)
  async for article in crawler.crawl_async(max_articles=10, only_complete=False):
    print(article)

await crawl()

While investigating this I found several other problems regarding notebooks.

  1. Most of Fundus won't run in the async environment of jupyter notebooks. (You can get the main crawler running by using the crawler's async api but not CCNewsCrawler`) -> I gonna hopefully get CCNews to run in a notebook with a PR today
  2. crawl_asyncs default parameter for only_complete is bugged and won't crawl
  3. The independent is blocking the complete us pipeline -> I will fix this in a PR today

Update: You don't have to wrap this within an async function. Colab seems to be fine with you doing this:

from fundus import Crawler, PublisherCollection

crawler = Crawler(*PublisherCollection.de)
async for article in crawler.crawl_async(max_articles=10):
  print(article)

Update: I posted the wrong script and updated it.

alanakbik commented 5 months ago

Yes, can confirm that your snippet runs for me!

MaxDall commented 5 months ago

@alanakbik We changed the version restrictions for our dependencies so installing Fundus within google colab should no longer yield an error. The new release may take some while but the changes are already live on master so using pip install git+https://github.com/flairNLP/fundus should do the trick for now.

alanakbik commented 5 months ago

Thanks, it now installs without problem!

But both these snippets still don't work for me:

from fundus import CCNewsCrawler, PublisherCollection

crawler = CCNewsCrawler(*PublisherCollection.de)
for article in crawler.crawl(max_articles=10, only_complete=False):
  print(article)

and

from fundus import PublisherCollection, Crawler

# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

but I think this is handled in different PRs?

MaxDall commented 5 months ago

Regarding the snippets:

  1. I quite don't understand the error colab is throwing at me, nor why it's running fine using pure Python but not within colab.
  2. That's expected. Colab is an asynchronous program and thus utilizes a running event loop. To use Fundus within an asynchronous environment you have to stick to crawl_async. I updated the code snippet above. Unfortunately, I originally posted the wrong one (the one referencing the CCNewsCrawler).
For reference the traceback from snippet 1 ```python --------------------------------------------------------------------------- RemoteTraceback Traceback (most recent call last) RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar return list(map(*args)) File "/usr/local/lib/python3.10/dist-packages/fundus/scraping/common_crawl/pipeline.py", line 59, in __call__ return self._deserialize()(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/fundus/scraping/common_crawl/pipeline.py", line 56, in _deserialize return dill.loads(self._serialized_target) File "/usr/local/lib/python3.10/dist-packages/dill/_dill.py", line 303, in loads return load(file, ignore, **kwds) File "/usr/local/lib/python3.10/dist-packages/dill/_dill.py", line 289, in load return Unpickler(file, ignore=ignore, **kwds).load() File "/usr/local/lib/python3.10/dist-packages/dill/_dill.py", line 444, in load obj = StockUnpickler.load(self) File "/usr/local/lib/python3.10/dist-packages/dill/_dill.py", line 593, in _create_type return typeobj(*args) File "/usr/lib/python3.10/typing.py", line 348, in __init_subclass__ raise TypeError("Cannot subclass special typing classes") TypeError: Cannot subclass special typing classes """ The above exception was the direct cause of the following exception: TypeError Traceback (most recent call last) [](https://localhost:8080/#) in () 3 if __name__ == '__main__': 4 crawler = CCNewsCrawler(*PublisherCollection.de) ----> 5 for article in crawler.crawl(max_articles=10, only_complete=False): 6 print(article) 3 frames [/usr/lib/python3.10/multiprocessing/pool.py](https://localhost:8080/#) in get(self, timeout) 772 return self._value 773 else: --> 774 raise self._value 775 776 def _set(self, i, obj): TypeError: Cannot subclass special typing classes ```