flairNLP / fundus

A very simple news crawler with a funny name
MIT License
126 stars 63 forks source link

Extends the publisher collection to allow source classes as input #201

Closed Weyaaron closed 11 months ago

Weyaaron commented 1 year ago

This helps with #178 : Now the Sitemap can be constructed inside the publisherspec and use the reverse flag, which improves the parsing behavior for occupy democrats.

Update (@MaxDall): I changed a few things with the biggest one may be to move the iteration logic of the former Source's to the newly added URLSource class. I did this so that the responsibility of URLs is decoupled from the Source itself. This allows us to now use custom sitemap filters to solve problems like #178.

E.g. setting up a URLSource like this in the publisher spec for #178:

from fundus import Sitemap
from fundus.scraping.filter import regex_filter

source = Sitemap(url="https://occupydemocrats.com/sitemap.xml", sitemap_filter=regex_filter(r"-tax-|-misc"))

for url in source:
    print(url)

yields the first article URL after 0.35 seconds while using the former URL filter would at least take >5min (that's where I stopped) and fetching countless unusable sites.

MaxDall commented 1 year ago

Hey, thanks for adding this. There is definitely a need to rework sources in the PublisherSpec.

That said, I think it would fit Fundus better to unify the implicit definitions of URLs in in the publisher specification to an explicit one. What I would like to have is something like this: Set up a class structure to explicitly define types for source URLs


@dataclass
class SourceUrl:
    url: str

@dataclass
class RSSFeed(SourceUrl):
    pass

@dataclass
class Sitemap(SourceUrl):
    recursive: bool = True
    reverse: bool = False

@dataclass
class NewsMap(Sitemap):
    pass

Simplify PublisherSpec

@dataclass(frozen=True)
class PublisherSpec:
    domain: str
    parser: Type[BaseParser]
    sources: List[SourceUrl]
    article_classifier: Optional[ArticleClassifier] = field(default=None)

and then propagate the changes through the code base. I.e a specific enum entry would then look like this:

    #before
    DieWelt = PublisherSpec(
        domain="https://www.welt.de/",
        rss_feeds=["https://www.welt.de/feeds/latest.rss"],
        sitemaps=["https://www.welt.de/sitemaps/sitemap/sitemap.xml"],
        news_map="https://www.welt.de/sitemaps/newssitemap/newssitemap.xml",
        parser=DieWeltParser,
    )

    #after
    DieWelt = PublisherSpec(
        domain="https://www.welt.de/",
        sources=[RSSFeed("https://www.welt.de/feeds/latest.rss"),
                 Sitemap("https://www.welt.de/feeds/latest.rss"),
                 NewsMap("https://www.welt.de/sitemaps/newssitemap/newssitemap.xml")],
        parser=DieWeltParser,
    )

This would make dealing with sources for publishers very easy and straight forward.

Weyaaron commented 1 year ago

Will be done.

Weyaaron commented 1 year ago

Has been done, but I don't understand what mypy is complaining about. @MaxDall You may take a look.

MaxDall commented 1 year ago

I finished this one so maybe @dobbersc should take a look since both @Weyaaron and I worked already on this one.

dobbersc commented 1 year ago

Sure, I'll review this one.