flairNLP / fundus

A very simple news crawler with a funny name
MIT License
290 stars 74 forks source link

Cannot run the pip installation #461

Closed BanoMarvey closed 6 months ago

BanoMarvey commented 6 months ago

Describe the bug

When running pip install -e .[dev], I get the error message saying zsh: no matches found: .[dev]. I do not know if this might be one of the issues why I am having trouble making my code work but I also have encountered another problem on my side. I am trying to add DailyMail as a new UK publisher and after adding it to the PublisherEnum class like so

  DailyMail = PublisherSpec(
        name="Daily Mail",
        domain="https://www.dailymail.co.uk/",
        sources=[Sitemap("https://www.dailymail.co.uk/google-news-sitemap.xml"),
                 NewsMap("https://www.dailymail.co.uk/google-news-sitemap1.xml")],
        parser=DailyMailParser,
    )

I went on to reproduce the code on step 4 of the tutorial namely

    from fundus import PublisherCollection, Crawler

publisher = PublisherCollection.uk.DailyMail

crawler = Crawler(publisher)

for article in crawler.crawl(max_articles=2, only_complete=False):
    print(article)

And I would get an error message saying

Traceback (most recent call last):
  File "/fundus/src/fundus/publishers/uk/daily_mail.py", line 22, in <module>
    publisher = PublisherCollection.uk.DailyMail
  File "/fundus/.conda/lib/python3.9/enum.py", line 429, in __getattr__
    raise AttributeError(name) from None
AttributeError: DailyMail

I have tried many different things and I feel as if I am missing something important in order for the system to recognize that a new publisher is being added since when i try the same code snipper but instead use one of the existing publishers then it works correctly.

Thanks in advance!

How to reproduce

pip install -e .[dev]

Expected behavior.

Installation of Fundus in editable mode and for the system to function as expected when adding a new publisher.

Logs and Stack traces

No response

Screenshots

No response

Additional Context

No response

Environment

Macbook Pro, Apple M1 Pro, MacOs 14.2.1 (Sonoma)
MaxDall commented 6 months ago

Hey @BanoMarvey, sorry to hear that you have trouble using the library. It looks like you haven't installed Fundus in editable mode. You can check if that's the case by running pip list. If there is a path specified after the package version, the package is installed in editable mode -e. This is necessary in order to work on the package.

Did you clone the repository and if so, in which directory do you execute pip install -e .[dev]?

Could you post all the steps you took from cloning the repository to installing the package?

BanoMarvey commented 6 months ago

I simply cloned it using VS Code and the link provided on the GitHub page (there is an extension in VSCode). Afterwards I activated the conda environment with Python 3.9. I tried the pip installation on the root directory which did not work, but i still tried to add the newspaper anyway, hoping that it would work. So now I am stuck with the newspaper not being recognized and the installation not working correctly.

MaxDall commented 6 months ago

@BanoMarvey Maybe this can help you.

BanoMarvey commented 6 months ago

So i followed the command on the other issue and the installation seems to be okay. Even when running pip list, i can see a path which shows that fundus has been installed correctly in editor mode. Still even when running after the pip installation has been completed i get an error message saying that

line 1, in <module>
    from fundus.parser import ParserProxy, BaseParser
ModuleNotFoundError: No module named 'fundus'
(.conda) (base) m@MacBook-Pro fundus % 

Should i still run a pip install fundus, even though the installation before was successful?

MaxDall commented 6 months ago

@BanoMarvey No, if pip install -e .[dev] run correctly the package should already be installed. Could you paste the result of pip list here?

Also, this line

(.conda) (base) m@MacBook-Pro fundus % 

looks kinda weird to me. I might miss something, but it looks like your using both a conda and a venv at the same time. If so, I'm pretty sure that's what causing your problems.

BanoMarvey commented 6 months ago

after running pip list this is what i got (.conda) (base) marvey@Marveys-MacBook-Pro fundus % pip list

Package               Version         Editable project location
--------------------- --------------- ----------------------------
attrs                 23.2.0
black                 23.1.0
Brotli                1.1.0
certifi               2024.2.2
chardet               5.2.0
charset-normalizer    3.3.2
click                 8.1.7
colorama              0.4.6
cssselect             1.2.0
dill                  0.3.8
exceptiongroup        1.2.1
FastWARC              0.14.6
feedparser            6.0.11
fundus                0.3.0           /Users/marvey/Desktop/fundus
idna                  3.7
iniconfig             2.0.0
isort                 5.12.0
langdetect            1.0.9
lxml                  4.9.4
more-itertools        9.1.0
mypy                  1.9.0
mypy-extensions       1.0.0
packaging             24.0
pathspec              0.12.1
pip                   23.3.1
platformdirs          4.2.1
pluggy                1.5.0
pytest                7.2.2
python-dateutil       2.9.0.post0
requests              2.31.0
setuptools            68.2.2
sgmllib3k             1.0.0
six                   1.16.0
tomli                 2.0.1
tqdm                  4.66.2
types-beautifulsoup4  4.12.0.20240229
types-colorama        0.4.15.20240311
types-html5lib        1.1.11.20240228
types-lxml            2024.4.14
types-python-dateutil 2.9.0.20240316
types-requests        2.31.0.20240406
typing_extensions     4.11.0
urllib3               2.2.1
validators            0.28.1
wheel                 0.41.2

I will try creating a new virtual environment either with venv or with conda and I will see if that fixes anything.

BanoMarvey commented 6 months ago

Okay so after cloning it from the start and creating a fresh virtual environment with conda, I installed the package in editor mode and got the following error when trying the implementation step. (nlpcourse) marvey@Marveys-MacBook-Pro uk % python3 daily_mail.py

Traceback (most recent call last):
  File "/Users/marvey/fundus/src/fundus/publishers/uk/daily_mail.py", line 6, in <module>
    from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute
  File "/Users/marvey/fundus/src/fundus/__init__.py", line 3, in <module>
    from fundus.publishers import PublisherCollection
  File "/Users/marvey/fundus/src/fundus/publishers/__init__.py", line 8, in <module>
    from fundus.publishers.uk import UK
  File "/Users/marvey/fundus/src/fundus/publishers/uk/__init__.py", line 11, in <module>
    from .daily_mail import DailyMailParser
  File "/Users/marvey/fundus/src/fundus/publishers/uk/daily_mail.py", line 20, in <module>
    from fundus import PublisherCollection, Crawler
ImportError: cannot import name 'PublisherCollection' from partially initialized module 'fundus' (most likely due to a circular import) (/Users/marvey/fundus/src/fundus/__init__.py)
(nlpcourse) marvey@Marveys-MacBook-Pro uk % 

The pip list command outputs the following

(nlpcourse) marvey@Marveys-MacBook-Pro uk % pip list
Package               Version         Editable project location
--------------------- --------------- -------------------------
attrs                 23.2.0
black                 23.1.0
Brotli                1.1.0
certifi               2024.2.2
chardet               5.2.0
charset-normalizer    3.3.2
click                 8.1.7
colorama              0.4.6
cssselect             1.2.0
dill                  0.3.8
exceptiongroup        1.2.1
FastWARC              0.14.6
feedparser            6.0.11
fundus                0.3.0           /Users/marvey/fundus
idna                  3.7
iniconfig             2.0.0
isort                 5.12.0
langdetect            1.0.9
lxml                  4.9.4
more-itertools        9.1.0
mypy                  1.9.0
mypy-extensions       1.0.0
packaging             24.0
pathspec              0.12.1
pip                   23.3.1
platformdirs          4.2.1
pluggy                1.5.0
pytest                7.2.2
python-dateutil       2.9.0.post0
requests              2.31.0
setuptools            68.2.2
sgmllib3k             1.0.0
six                   1.16.0
tomli                 2.0.1
tqdm                  4.66.2
types-beautifulsoup4  4.12.0.20240229
types-colorama        0.4.15.20240311
types-html5lib        1.1.11.20240228
types-lxml            2024.4.14
types-python-dateutil 2.9.0.20240316
types-requests        2.31.0.20240406
typing_extensions     4.11.0
urllib3               2.2.1
validators            0.28.1

Is there anything else I can try?

MaxDall commented 6 months ago

@BanoMarvey Yes, remove the line from fundus import Pub... from daily_mail.py. That's a circular import as stated in the traceback. If you wanna test your progress, best practice is to use your own script at the repository root.

BanoMarvey commented 6 months ago

Ah yes, thank you. I was so focused on the problem with the packages I forgot that the implementation was wrong. It’s all cleared up now.