Swirrl / ook

Structural search engine
https://search-prototype.gss-data.org.uk/
Eclipse Public License 1.0
6 stars 0 forks source link

Enumerate all codelists when matching datasets to facet selections #52

Closed Robsteranium closed 3 years ago

Robsteranium commented 3 years ago

We need to show, for each dataset and facet, which of the selected codelists are in use and to provide some example codes for context. The draft query won't guarantee to get all examples (indeed we only want some). It could actually give the impression that a selected code was not present when it was (the collapsed e.g. top 3 hits per dataset could all be about the first code).

We can enumerate the dimension values using aggregations instead of collapse. This returns, for each dataset and each dimension, the count of observation by dimension-value. For each dataset, typically only one dimension would have results. This additionally gives us a count by dimension-value. Although it will be better at enumerating codes than the collapse version (since it's grouping), it won't guarantee to find example codes in all of the codelists though - we'd take the top e.g. 10 hits per dimension, which could all come from the first codelist. We could set the size to an exhaustively high value and take e.g. the first 3 codes by codelist in clojure.

{
  "size": 0,
  "query": {
    "bool": {
      "should": [
        {
          "terms": {
            "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-clearances#dimension/period.@id": [
              "http://reference.data.gov.uk/id/year/2019"
            ]
          }
        },
        {
          "terms": {
            "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-duty-receipts#dimension/period.@id": [
              "http://reference.data.gov.uk/id/year/2019"
            ]
          }
        },
        {
          "exists": {
            "field": "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-production#dimension/period.@id"
          }
        }
      ]
    }
  },
  "aggregations": {
    "datasets": {
      "terms": {
        "field": "qb:dataSet.@id"
      },
      "aggregations": {
        "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-clearances#dimension/period.@id": {
          "terms": {
            "field": "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-clearances#dimension/period.@id"
          }
        },
        "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-duty-receipts#dimension/period.@id": {
          "terms": {
            "field": "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-duty-receipts#dimension/period.@id"
          }
        },
        "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-production#dimension/period.@id": {
          "terms": {
            "field": "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-production#dimension/period.@id"
          }
        }
      }
    }
  }
}

This could be quite slow.

Alternatively we could denormalise the codelist onto the dimension-values in the observation index. This would allow us to add a second level of collapse (on inner_hits of the dataset collapsing) by e.g. "field": "<dimension>.scheme" to yield the top X codes by codelist by dataset.

Robsteranium commented 3 years ago

Initial experiments with aggregation aren't showing any obvious slowdown.

Robsteranium commented 3 years ago

Resolved with #68.