everypolitician / everypolitician

Data about every national legislature in the world, freely available for you to use
everypolitician.org
21 stars 9 forks source link

Build Wikidata via SPARQL, rather than API #627

Open tmtmtmtm opened 6 years ago

tmtmtmtm commented 6 years ago

Currently most of the Wikidata scrapers use item-by-item API calls. This is quite slow, and, on big datasets, runs out memory on Morph.

Instead we should migrate as much as possible to be built from SPARQL queries.

tmtmtmtm commented 6 years ago

Perhaps we should start this with the Area scrapers. https://morph.io/everypolitician-scrapers/us-states-wikidata is running out of memory, and the core query in that could be replaced by something like:

  SELECT ?item
      (GROUP_CONCAT(DISTINCT ?flag) AS ?flag)
      (GROUP_CONCAT(DISTINCT ?coat_of_arms) AS ?coat_of_arms)
      (GROUP_CONCAT(DISTINCT ?iso_code) AS ?iso_code)
      (GROUP_CONCAT(DISTINCT ?start_date) AS ?start_date)
      (GROUP_CONCAT(DISTINCT ?end_date) AS ?end_date)
      (GROUP_CONCAT(DISTINCT ?website) AS ?website)
      (GROUP_CONCAT(DISTINCT ?replaces) AS ?replaces)
      (GROUP_CONCAT(DISTINCT ?replaced_by) AS ?replaced_by)
      (GROUP_CONCAT(DISTINCT ?identifier__viaf) AS ?identifier__viaf)
      (GROUP_CONCAT(DISTINCT ?identifier__gnd) AS ?identifier__gnd)
      (GROUP_CONCAT(DISTINCT ?identifier__lcauth) AS ?identifier__lcauth)
      (GROUP_CONCAT(DISTINCT ?identifier__bnf) AS ?identifier__bnf)
      (GROUP_CONCAT(DISTINCT ?identifier__openstreetmap) AS ?identifier__openstreetmap)
      (GROUP_CONCAT(DISTINCT ?identifier__freebase) AS ?identifier__freebase)
      (GROUP_CONCAT(DISTINCT ?identifier__gss) AS ?identifier__gss)
      (GROUP_CONCAT(DISTINCT ?identifier__fips) AS ?identifier__fips)
      (GROUP_CONCAT(DISTINCT ?identifier__dmoz) AS ?identifier__dmoz)
      (GROUP_CONCAT(DISTINCT ?identifier__britannica) AS ?identifier__britannica)
      (GROUP_CONCAT(DISTINCT ?identifier__geonames) AS ?identifier__geonames)
      (GROUP_CONCAT(DISTINCT ?identifier__bbc_things) AS ?identifier__bbc_things)
      (GROUP_CONCAT(DISTINCT ?identifier__tgn) AS ?identifier__tgn)
      (GROUP_CONCAT(DISTINCT ?identifier__guardian) AS ?identifier__guardian)
      (GROUP_CONCAT(DISTINCT ?identifier__newyorktimes) AS ?identifier__newyorktimes)
      (GROUP_CONCAT(DISTINCT ?identifier__quora) AS ?identifier__quora)
  WHERE {
      ?item wdt:P31 wd:Q35657 .
      OPTIONAL { ?item wdt:P41 ?flag }
      OPTIONAL { ?item wdt:P94 ?coat_of_arms }
      OPTIONAL { ?item wdt:P300 ?iso_code }
      OPTIONAL { ?item wdt:P571 ?start_date }
      OPTIONAL { ?item wdt:P576 ?end_date }
      OPTIONAL { ?item wdt:P580 ?start_date }
      OPTIONAL { ?item wdt:P582 ?end_date }
      OPTIONAL { ?item wdt:P856 ?website }
      OPTIONAL { ?item wdt:P1365 ?replaces }
      OPTIONAL { ?item wdt:P1366 ?replaced_by }
      OPTIONAL { ?item wdt:P214 ?identifier__viaf }
      OPTIONAL { ?item wdt:P227 ?identifier__gnd }
      OPTIONAL { ?item wdt:P244 ?identifier__lcauth }
      OPTIONAL { ?item wdt:P268 ?identifier__bnf }
      OPTIONAL { ?item wdt:P402 ?identifier__openstreetmap }
      OPTIONAL { ?item wdt:P646 ?identifier__freebase }
      OPTIONAL { ?item wdt:P836 ?identifier__gss }
      OPTIONAL { ?item wdt:P901 ?identifier__fips }
      OPTIONAL { ?item wdt:P998 ?identifier__dmoz }
      OPTIONAL { ?item wdt:P1417 ?identifier__britannica }
      OPTIONAL { ?item wdt:P1566 ?identifier__geonames }
      OPTIONAL { ?item wdt:P1617 ?identifier__bbc_things }
      OPTIONAL { ?item wdt:P1667 ?identifier__tgn }
      OPTIONAL { ?item wdt:P3106 ?identifier__guardian }
      OPTIONAL { ?item wdt:P3221 ?identifier__newyorktimes }
      OPTIONAL { ?item wdt:P3417 ?identifier__quora }
  }
  GROUP BY ?item

and then building the labels per #628