commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Use wikidata to complete seeds #50

Open sebastian-nagel opened 1 year ago

sebastian-nagel commented 1 year ago

Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:

tfmorris commented 9 months ago

Wikidata based seed URLs will probably require some significant deduplication, filtering, reranking, etc, but here's a version of the query which adds the language of the URL to account for sites which have different base URLs for different languages, like Blick. It also expands the language list (because * doesn't work), but it could be generalized more. As an example of the type of filtering needed, the Hubei Daily item has three URLs - a corporate site, an e-paper, and a 404.

SELECT DISTINCT ?item ?itemLabel ?lang ?worklang ?url WHERE {
  ?item (wdt:P31/(wdt:P279*)) wd:Q11032;
    p:P856 ?statement.
  ?statement ps:P856 ?url.
  OPTIONAL {
    ?statement pq:P407 ?worklanguage.
    ?worklanguage wdt:P220 ?worklang.
  }
  OPTIONAL {
    ?item wdt:P407 ?language.
    ?language wdt:P220 ?lang.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,uk,ru,fr,es,it,ja,zh,ar,hu,pt,be,rus,ce,br,cs,sv,dk,da,he,fi,nb,id,eu,pl,nl,az,mar,lv,hr,am,ba,r". }
}
LIMIT 100

Query As of today, there are 11,177 results. There are more than 200 languages represented, plus a couple of thousand sites with no language tag, and that distribution looks like about what you'd expect (the two letter codes represent TLDs, not language codes, eg. hk, ru, uk, de, au, cn, etc):

eng 3562
fra 826
spa 586
rus 467
deu 316
ita 177
ara 168
ukr 166
fin 152
zho 146
jpn 145
swe 140
nor 122
hk  112
ru  112
por 108
hun 103
nld 93
uk  90
de  86
kor 86
au  78
cn  78
pol 66
hin 60
bel 59