Swirrl / ook

Structural search engine
https://search-prototype.gss-data.org.uk/
Eclipse Public License 1.0
6 stars 0 forks source link

Missing values for code-used field #69

Closed Robsteranium closed 3 years ago

Robsteranium commented 3 years ago

For some reason this field has missing values. It ought to be either "true" or "false".

If I run the following query:

{
  "size": 0,
  "aggregations": {
    "used": {
      "terms": {
        "field": "used"
      }
    }
  }
}

Then across the buckets I see a total of 49,441 values despite having 50,337 codes.

Indeed the following query matching the remaining 896 docs where this field is missing.

{
  "query": {
    "bool": {
      "must_not": [
        {
          "exists": {
            "field": "used"
          }
        }
      ]
    }
  }
}

It's not clear if this is to do with select-pagination or upserts.

Robsteranium commented 3 years ago

This appears to be caused by the select queries used for paging. The results for successive pages (with increasing limit/offset) aren't contiguous (even without any writes in the meantime). Adding an e.g. ORDER BY ?uri clause seems to resolve this.

The same problem will plague the other pipelines. This solution doesn't work for the observation pager as the order by clause causes it to time out (at least on idp-beta with 28m observations).