Objective-Earth / product-design

Apache License 2.0
5 stars 1 forks source link

Importing Data from Wikidata #81

Open p-mohan opened 2 years ago

p-mohan commented 2 years ago

The objective of this task is to collect Country, State, Cities/Towns data from a local Wikidata dump. Here we are using a local instance of Wikidata due to the query timeout of the public Wikidata service.

The following sparql query statement is giving expected results.

select DISTINCT ?city ?lat ?lon ?cityLabel ?districtLabel ?stateLabel ?countryLabel where {
   hint:Query hint:optimizer "None".
   VALUES ?country {wd:Q408}.
  ?country wdt:P31 wd:Q3624078.
  ?state wdt:P131 ?country.
  ?district wdt:P131 ?state.
  ?city wdt:P131 ?district;
 p:P625 ?coordinate.
 ?coordinate ps:P625 ?coord.
 ?coordinate psv:P625 ?coordinate_node.
 ?coordinate_node wikibase:geoLongitude ?lon.
 ?coordinate_node wikibase:geoLatitude ?lat.  
  MINUS {?city  wdt:P582 ?endTime }.
  FILTER EXISTS {?city wdt:P1082 ?pop}.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
} ORDER BY ?country LIMIT 5

The prefix p: points to a statement node. The prefix ps: within a statement node for a coordinate retrieves the full coordinate, like Point(2.1749 41.3834). The prefix psv: within a statement node retrieves a coordinate node. The wikibase:geoLongitude within the coordinate node retrieves the longitude value. The wikibase:geoLatitude within a coordinate node retrieves the latitude value.

This sample query limits the results to country Q408 (Australia). Due to somewhat non-trivial hierarchy of Wikidata, to remove entities such as buildings I am using a filter to see if a city has a population property. It also filters out locations that have "ended".

p-mohan commented 2 years ago

It took 34 days to create a local usable copy of the Wikidata database ( 4 thread 32 GB instance) from the compressed public download of the wiki dump (around 100GB download). The uncompressed database is 790 GB in size. The next task will be to run the above sparql query statement to extract the city data.