Open tmtmtmtm opened 6 years ago
Perhaps we should start this with the Area scrapers. https://morph.io/everypolitician-scrapers/us-states-wikidata is running out of memory, and the core query in that could be replaced by something like:
SELECT ?item
(GROUP_CONCAT(DISTINCT ?flag) AS ?flag)
(GROUP_CONCAT(DISTINCT ?coat_of_arms) AS ?coat_of_arms)
(GROUP_CONCAT(DISTINCT ?iso_code) AS ?iso_code)
(GROUP_CONCAT(DISTINCT ?start_date) AS ?start_date)
(GROUP_CONCAT(DISTINCT ?end_date) AS ?end_date)
(GROUP_CONCAT(DISTINCT ?website) AS ?website)
(GROUP_CONCAT(DISTINCT ?replaces) AS ?replaces)
(GROUP_CONCAT(DISTINCT ?replaced_by) AS ?replaced_by)
(GROUP_CONCAT(DISTINCT ?identifier__viaf) AS ?identifier__viaf)
(GROUP_CONCAT(DISTINCT ?identifier__gnd) AS ?identifier__gnd)
(GROUP_CONCAT(DISTINCT ?identifier__lcauth) AS ?identifier__lcauth)
(GROUP_CONCAT(DISTINCT ?identifier__bnf) AS ?identifier__bnf)
(GROUP_CONCAT(DISTINCT ?identifier__openstreetmap) AS ?identifier__openstreetmap)
(GROUP_CONCAT(DISTINCT ?identifier__freebase) AS ?identifier__freebase)
(GROUP_CONCAT(DISTINCT ?identifier__gss) AS ?identifier__gss)
(GROUP_CONCAT(DISTINCT ?identifier__fips) AS ?identifier__fips)
(GROUP_CONCAT(DISTINCT ?identifier__dmoz) AS ?identifier__dmoz)
(GROUP_CONCAT(DISTINCT ?identifier__britannica) AS ?identifier__britannica)
(GROUP_CONCAT(DISTINCT ?identifier__geonames) AS ?identifier__geonames)
(GROUP_CONCAT(DISTINCT ?identifier__bbc_things) AS ?identifier__bbc_things)
(GROUP_CONCAT(DISTINCT ?identifier__tgn) AS ?identifier__tgn)
(GROUP_CONCAT(DISTINCT ?identifier__guardian) AS ?identifier__guardian)
(GROUP_CONCAT(DISTINCT ?identifier__newyorktimes) AS ?identifier__newyorktimes)
(GROUP_CONCAT(DISTINCT ?identifier__quora) AS ?identifier__quora)
WHERE {
?item wdt:P31 wd:Q35657 .
OPTIONAL { ?item wdt:P41 ?flag }
OPTIONAL { ?item wdt:P94 ?coat_of_arms }
OPTIONAL { ?item wdt:P300 ?iso_code }
OPTIONAL { ?item wdt:P571 ?start_date }
OPTIONAL { ?item wdt:P576 ?end_date }
OPTIONAL { ?item wdt:P580 ?start_date }
OPTIONAL { ?item wdt:P582 ?end_date }
OPTIONAL { ?item wdt:P856 ?website }
OPTIONAL { ?item wdt:P1365 ?replaces }
OPTIONAL { ?item wdt:P1366 ?replaced_by }
OPTIONAL { ?item wdt:P214 ?identifier__viaf }
OPTIONAL { ?item wdt:P227 ?identifier__gnd }
OPTIONAL { ?item wdt:P244 ?identifier__lcauth }
OPTIONAL { ?item wdt:P268 ?identifier__bnf }
OPTIONAL { ?item wdt:P402 ?identifier__openstreetmap }
OPTIONAL { ?item wdt:P646 ?identifier__freebase }
OPTIONAL { ?item wdt:P836 ?identifier__gss }
OPTIONAL { ?item wdt:P901 ?identifier__fips }
OPTIONAL { ?item wdt:P998 ?identifier__dmoz }
OPTIONAL { ?item wdt:P1417 ?identifier__britannica }
OPTIONAL { ?item wdt:P1566 ?identifier__geonames }
OPTIONAL { ?item wdt:P1617 ?identifier__bbc_things }
OPTIONAL { ?item wdt:P1667 ?identifier__tgn }
OPTIONAL { ?item wdt:P3106 ?identifier__guardian }
OPTIONAL { ?item wdt:P3221 ?identifier__newyorktimes }
OPTIONAL { ?item wdt:P3417 ?identifier__quora }
}
GROUP BY ?item
and then building the labels per #628
Currently most of the Wikidata scrapers use item-by-item API calls. This is quite slow, and, on big datasets, runs out memory on Morph.
Instead we should migrate as much as possible to be built from SPARQL queries.