commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Allow to follow news sites not providing RSS/Atom feed or news sitemap #41

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

The news crawler (as of now) relies exclusively on RSS/Atom feeds and news sitemaps to find links to news articles. However, some news sites do not provide feeds or sitemaps. In order to follow these news sites, the crawler should be able monitor HTML pages manually marked as seeds and extract links from it:

vladignatyev commented 8 months ago

Currently I'm working on very similar software. How could I contribute to the project?

wumpus commented 8 months ago

Vlad, this project is not currently a high priority for us. This enhancement is a good idea, and it's an idea that the search engine I founded a long time ago used successfully for our news crawl.