Swirrl / ook

Structural search engine
https://search-prototype.gss-data.org.uk/
Eclipse Public License 1.0
6 stars 0 forks source link

Observation pipeline select-page with order-by times out #95

Closed Robsteranium closed 3 years ago

Robsteranium commented 3 years ago

We need the ORDER BY to ensure that the LIMIT/OFFSET gets contiguous pages of URIs (#69).

Ordering the observation-select.sparql causes the query to time out (on 28m observations on idp beta atm).

We could spill all the selected URIs to disk (in a single query with no limit/offset/order by) and page through that instead.

Downloading all observation-uris takes a minute and is about 600M on disk.

curl 'https://beta.gss-data.org.uk/sparql' -d 'query=PREFIX%20qb%3A%20%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E%0A%0ASELECT%20%3Fobservation%0AWHERE%20%7B%20%0A%20%20%3Fobservation%20qb%3AdataSet%20%3Fcube%20.%0A%7D' > observations.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  568M    0  568M  100   178  9064k      2  0:01:29  0:01:04  0:00:25  9.7M