flairNLP / fundus

A very simple news crawler with a funny name
MIT License
98 stars 59 forks source link

[Bug]: url_filter in PublisherSpec not filtering #465

Closed Benjamin2107 closed 2 weeks ago

Benjamin2107 commented 3 weeks ago

Describe the bug

While working on #464 I had trouble filtering some regex in the url_filter of PublisherSpec.

All unit tests are working fine but after testing the crawler myself I recognized videos and slideshows from my selected newspaper don't get filtered.

Is this a bug or is this my fault?

How to reproduce

from fundus import PublisherCollection, Crawler

publisher = PublisherCollection.de.Kicker

crawler = Crawler(publisher)

for article in crawler.crawl(max_articles=20, only_complete=False):
    print(article)

Expected behavior.

Only "*/article*" urls should be shown. Instead there are urls containing "*/video*" or "*/slideshow*". (Depending on if the last 20 news are even containing videos or slideshows=

Logs and Stack traces

No response

Screenshots

No response

Additional Context

No response

Environment

OS: Windows 11
Fundus master branch + my new feature (See #464)
python 3.9

Installed packages:
attrs==23.2.0
black==23.1.0
Brotli==1.1.0
certifi==2024.2.2
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
cssselect==1.2.0
dill==0.3.8
exceptiongroup==1.2.1
FastWARC==0.14.6
feedparser==6.0.11
-e (fundus)
idna==3.7
iniconfig==2.0.0
isort==5.12.0
langdetect==1.0.9
lxml==4.9.4
more-itertools==9.1.0
mypy==1.9.0
mypy-extensions==1.0.0
packaging==24.0
pathspec==0.12.1
platformdirs==4.2.0
pluggy==1.5.0
pytest==7.2.2
python-dateutil==2.9.0.post0
requests==2.31.0
tqdm==4.66.2
types-beautifulsoup4==4.12.0.20240229
types-colorama==0.4.15.20240311
types-html5lib==1.1.11.20240228
types-lxml==2024.4.14
types-python-dateutil==2.9.0.20240316
types-requests==2.31.0.20240406
typing_extensions==4.11.0
urllib3==2.2.1
validators==0.28.1
MaxDall commented 3 weeks ago

@Benjamin2107 When I run the code snippet above, only the filter specified in the Kicker publisher enum works as intended, but that got only fixed recently with #459. Could you confirm if this is also the case for you?

You can get the debugging logging messages enabled with

import logging
from fundus.logging import set_log_level

set_log_level(logging.DEBUG)
Benjamin2107 commented 2 weeks ago

Yes, it is working now. Thanks :)