Use wikidata to complete seeds

Wikidata based seed URLs will probably require some significant deduplication, filtering, reranking, etc, but here's a version of the query which adds the language of the URL to account for sites which have different base URLs for different languages, like Blick. It also expands the language list (because * doesn't work), but it could be generalized more. As an example of the type of filtering needed, the Hubei Daily item has three URLs - a corporate site, an e-paper, and a 404.

SELECT DISTINCT ?item ?itemLabel ?lang ?worklang ?url WHERE {
  ?item (wdt:P31/(wdt:P279*)) wd:Q11032;
    p:P856 ?statement.
  ?statement ps:P856 ?url.
  OPTIONAL {
    ?statement pq:P407 ?worklanguage.
    ?worklanguage wdt:P220 ?worklang.
  }
  OPTIONAL {
    ?item wdt:P407 ?language.
    ?language wdt:P220 ?lang.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,uk,ru,fr,es,it,ja,zh,ar,hu,pt,be,rus,ce,br,cs,sv,dk,da,he,fi,nb,id,eu,pl,nl,az,mar,lv,hr,am,ba,r". }
}
LIMIT 100

Query As of today, there are 11,177 results. There are more than 200 languages represented, plus a couple of thousand sites with no language tag, and that distribution looks like about what you'd expect (the two letter codes represent TLDs, not language codes, eg. hk, ru, uk, de, au, cn, etc):

eng 3562
fra 826
spa 586
rus 467
deu 316
ita 177
ara 168
ukr 166
fin 152
zho 146
jpn 145
swe 140
nor 122
hk  112
ru  112
por 108
hun 103
nld 93
uk  90
de  86
kor 86
au  78
cn  78
pol 66
hin 60
bel 59

commoncrawl / news-crawl

Use wikidata to complete seeds #50