Closed Weyaaron closed 11 months ago
Hey, thanks for adding this. There is definitely a need to rework sources in the PublisherSpec
.
That said, I think it would fit Fundus better to unify the implicit definitions of URLs in in the publisher specification to an explicit one. What I would like to have is something like this: Set up a class structure to explicitly define types for source URLs
@dataclass
class SourceUrl:
url: str
@dataclass
class RSSFeed(SourceUrl):
pass
@dataclass
class Sitemap(SourceUrl):
recursive: bool = True
reverse: bool = False
@dataclass
class NewsMap(Sitemap):
pass
Simplify PublisherSpec
@dataclass(frozen=True)
class PublisherSpec:
domain: str
parser: Type[BaseParser]
sources: List[SourceUrl]
article_classifier: Optional[ArticleClassifier] = field(default=None)
and then propagate the changes through the code base. I.e a specific enum entry would then look like this:
#before
DieWelt = PublisherSpec(
domain="https://www.welt.de/",
rss_feeds=["https://www.welt.de/feeds/latest.rss"],
sitemaps=["https://www.welt.de/sitemaps/sitemap/sitemap.xml"],
news_map="https://www.welt.de/sitemaps/newssitemap/newssitemap.xml",
parser=DieWeltParser,
)
#after
DieWelt = PublisherSpec(
domain="https://www.welt.de/",
sources=[RSSFeed("https://www.welt.de/feeds/latest.rss"),
Sitemap("https://www.welt.de/feeds/latest.rss"),
NewsMap("https://www.welt.de/sitemaps/newssitemap/newssitemap.xml")],
parser=DieWeltParser,
)
This would make dealing with sources for publishers very easy and straight forward.
Will be done.
Has been done, but I don't understand what mypy is complaining about. @MaxDall You may take a look.
I finished this one so maybe @dobbersc should take a look since both @Weyaaron and I worked already on this one.
Sure, I'll review this one.
This helps with #178 : Now the Sitemap can be constructed inside the publisherspec and use the reverse flag, which improves the parsing behavior for occupy democrats.
Update (@MaxDall): I changed a few things with the biggest one may be to move the iteration logic of the former
Source
's to the newly addedURLSource
class. I did this so that the responsibility of URLs is decoupled from theSource
itself. This allows us to now use custom sitemap filters to solve problems like #178.E.g. setting up a
URLSource
like this in the publisher spec for #178:yields the first article URL after 0.35 seconds while using the former URL filter would at least take >5min (that's where I stopped) and fetching countless unusable sites.