commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Explore schema.org annotations for seed completions #53

Open sebastian-nagel opened 1 year ago

sebastian-nagel commented 1 year ago

Explore schema.org annotation NewsArticle from CC main crawls or WDC to complete the list of news sites/domains used to look for news feeds and sitemaps. The issue is not to find seed candidates but to select only real news sites.