Speed up queries - Githubissues

blinry commented 5 years ago

We want card deck generation to be fast, but get data which is as complete as possible. Currently, our attempt to do that looks like this:

First, we run a query like

SELECT (COUNT(?item) AS ?count) WHERE { ?item wdt:P31 wd:Q11344. }

to find out how many items are direct instances of the type we're interested in. If this number is over 100, we seem to have enough items and run a query using the sample service to get no more than 3000 items back:

SELECT ?item ?itemLabel ?itemDescription ?image ?property ?propLabel ?valueLabel ?unitLabel ?precision WHERE {
  SERVICE bd:sample {
    ${restriction}
    bd:serviceParam bd:sample.limit 3000 .
    bd:serviceParam bd:sample.sampleType "RANDOM".
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,de". }
  OPTIONAL { ?item wdt:P18 ?image. }
  ?item ?p ?statement.
  ?statement a wikibase:BestRank.

  ?property rdfs:label ?propLabel.
  ?property wikibase:claim ?p.
  ?property rdf:type wikibase:Property .

  FILTER (lang(?propLabel) = 'en' ).

  {
    ?property wikibase:propertyType wikibase:Quantity.

    ?statement ?psn ?valueNode.    
    ?valueNode wikibase:quantityAmount ?value.
    ?valueNode wikibase:quantityUnit ?unit.

    ?property wikibase:statementValue ?psn.
  } UNION {
    ?property wikibase:propertyType wikibase:Time.

    ?statement ?psn ?valueNode.
    ?valueNode wikibase:timeValue ?value.
    ?valueNode wikibase:timePrecision ?precision.

    ?property wikibase:statementValue ?psn.
  }
}

If the count is smaller than 100, we extend the query to also accept subclasses, and get all of them, in the hope of getting enough items:

SELECT ?item ?itemLabel ?itemDescription ?image ?property ?propLabel ?valueLabel ?unitLabel ?precision WHERE {
  ?item wdt:P31/wdt:P279* wd:Q11344.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,de". }
  OPTIONAL { ?item wdt:P18 ?image. }
  ?item ?p ?statement.
  ?statement a wikibase:BestRank.

  ?property rdfs:label ?propLabel.
  ?property wikibase:claim ?p.
  ?property rdf:type wikibase:Property .

  FILTER (lang(?propLabel) = 'en' ).

  {
    ?property wikibase:propertyType wikibase:Quantity.

    ?statement ?psn ?valueNode.    
    ?valueNode wikibase:quantityAmount ?value.
    ?valueNode wikibase:quantityUnit ?unit.

    ?property wikibase:statementValue ?psn.
  } UNION {
    ?property wikibase:propertyType wikibase:Time.

    ?statement ?psn ?valueNode.
    ?valueNode wikibase:timeValue ?value.
    ?valueNode wikibase:timePrecision ?precision.

    ?property wikibase:statementValue ?psn.
  }
}

The second type of query gives more complete results, but does not terminate in 60 seconds for huge classes like humans or cities. Can we optimize it somehow? We're flexible in how the resulting table looks like, as long as all the data is there. We're interested in as many items as possible which have a lot of numeric or time-based properties. We could also split the query into individual subqueries, if that helps?

blinry commented 5 years ago

Maybe @lucaswerkmeister has any ideas? <3

lucaswerkmeister commented 5 years ago

If you only want e. g. 3000 items, you can use a subquery to limit the items to that number, avoiding to run the rest of the query on the huge result set:

SELECT ?item ?itemLabel ?itemDescription ?image ?property ?propLabel ?valueLabel ?unitLabel ?precision WITH {
  SELECT ?item WHERE {
    ?item wdt:P31/wdt:P279* wd:Q11344.
  }
  ORDER BY MD5(CONCAT(STR(NOW()), STR(?item)))
  LIMIT 3000
} AS %items WHERE {
  INCLUDE %items.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,de". }
  OPTIONAL { ?item wdt:P18 ?image. }
  ?item ?p ?statement.
  ?statement a wikibase:BestRank.

  ?property rdfs:label ?propLabel.
  ?property wikibase:claim ?p.
  ?property rdf:type wikibase:Property .

  FILTER (lang(?propLabel) = 'en' ).

  {
    ?property wikibase:propertyType wikibase:Quantity.

    ?statement ?psn ?valueNode.    
    ?valueNode wikibase:quantityAmount ?value.
    ?valueNode wikibase:quantityUnit ?unit.

    ?property wikibase:statementValue ?psn.
  } UNION {
    ?property wikibase:propertyType wikibase:Time.

    ?statement ?psn ?valueNode.
    ?valueNode wikibase:timeValue ?value.
    ?valueNode wikibase:timePrecision ?precision.

    ?property wikibase:statementValue ?psn.
  }
}

blinry commented 5 years ago

That helps a lot, thanks!

lucaswerkmeister commented 5 years ago

I was hoping to optimize the query a bit more, but didn’t get anywhere :/ but I’ve just dumped my two attempts in this gist, perhaps you can still find something useful there (I couldn’t find a way to avoid timeouts).

blinry commented 5 years ago

One approach that seems to work in many cases is to order the items in the full item set by how many statements they have, and take the (say) top 100. Then do the rest on the query on those. This seems to yield high-quality results in many cases, and works for mountains and rivers, but not for humans. :P

SELECT ?item ?itemLabel ?itemDescription ?image ?property ?propLabel ?valueLabel ?unitLabel ?precision WITH {
  SELECT DISTINCT ?item WHERE {
    ?item (wdt:P31/wdt:P279*) wd:Q8502.
    ?item wikibase:statements ?statements.
  }
  ORDER BY DESC(?statements)
  LIMIT 100
} AS %items
WHERE {
  INCLUDE %items.

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,de". }

  OPTIONAL { ?item wdt:P18 ?image. }

  ?item ?p ?statement.
  ?statement a wikibase:BestRank.

  ?property rdfs:label ?propLabel.
  ?property wikibase:claim ?p.
  ?property rdf:type wikibase:Property .

  FILTER (lang(?propLabel) = 'en' ).
  FILTER (?propLabel != "inception"@en).

  {
    ?property wikibase:propertyType wikibase:Quantity.

    ?statement ?psn ?valueNode.
    ?valueNode wikibase:quantityAmount ?value.
    ?valueNode wikibase:quantityUnit ?unit.

    ?property wikibase:statementValue ?psn.
  } UNION {
    ?property wikibase:propertyType wikibase:Time.

    ?statement ?psn ?valueNode.
    ?valueNode wikibase:timeValue ?value.
    ?valueNode wikibase:timePrecision ?precision.

    ?property wikibase:statementValue ?psn.
  }
}

blinry / wikidata-cardgame

Speed up queries #2