Swirrl / ook

Structural search engine
https://search-prototype.gss-data.org.uk/
Eclipse Public License 1.0
6 stars 0 forks source link

Introduce curated facets #22

Closed Robsteranium closed 3 years ago

Robsteranium commented 3 years ago

Ultimately we want to be able to let users filter the list of datasets by code, codelist or dimension.

This is quite complex for the user and involves a lot of potential choices.

The facets provide a set of partially-configured searches, locking a dimension (or dimensions) or codelist.

Building on #21, each applied facets gets a column in the results table.

Facets have a name (to encapsulate the configuration). This name is displayed in the control that introduces the facet ("+ Product") and as the header on the column that's displayed when the facet is applied.

We'll need to create some examples of the configuration.

A trivial implementation would be to simply restrict the code-search results to ones that could appear on a given dimension (via its codelists) or that are contained in a given codelist.

What we actually want to support is the capacity to also find datasets by choosing dimensions or codelist (and without needing to choose specific codes).

The (currently) proposed UI for this is to show a table of results when creating filters. Only the elements that haven't been locked by the facet would be displayed. Locked elements serve to restrict the choices presented. It may thus make sense to build the custom-facet interface first, then remove parts of it for curated facets.

The table shows columns for dimension, codelist and code. The dimension and codelist cells can include row-spans (or similar DOM structure) to show where e.g. one dimension has two codelists or one codelist has more than one matching code.

To begin with we should present controls to select dimensions, codelists or codes.

To extend this, we can look at transitive changes. Selecting a dimension should hide/ disable controls to select codelists or codes (the implication is that any can apply). Selecting a codelist should hide/ disable controls to select individual codes.

Robsteranium commented 3 years ago

Ideally we'd have a facet configuration something like this:

{:ook.search.facets
 [{:name "Product"
   :parent_dimension "http://gss-data.org.uk/def/trade/property/dimension/product"}
  {:name "Industry" ;; should this be further down the list?
   :parent_dimension "http://gss-data.org.uk/def/trade/property/dimension/industry"}
  {:name "Reporter Geography" ;; should this be joined with http://gss-data.org.uk/def/trade/property/dimension/geography or sdmxd:refArea
   :parent_dimension "http://gss-data.org.uk/def/trade/property/dimension/reporter-geography"}
  {:name "Partner Geography"
   :parent_dimension "http://gss-data.org.uk/def/trade/property/dimension/partner-geography"}
  {:name "Flow"
   :parent_dimension "http://gss-data.org.uk/def/trade/property/dimension/flow-directions"}
  {:name "Date"
   :parent_dimension "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod"}
  {:name "Measure"
   :parent_dimension "http://purl.org/linked-data/cube#measureType"}
  {:name "Basis"
   :parent_dimension "http://gss-data.org.uk/def/trade/property/dimension/international-trade-basis"}
  {:name "Seasonality"
   :parent_dimension "http://gss-data.org.uk/def/trade/property/dimension/seasonal-adjustments"}]}

You can find all the dimensions that are children of the :parent_dimension in the components index using the "subPropertyOf" property. This config ought to have URIs that match the prefixes described in the @context of resources/etl/component-frame.json.

You can find the codelists that apply for each component under the "codelist" property in the same index. We don't need to specify this in the configuration.

The data isn't available to drive this so we'll have to be a little less succinct. This alternate, workaround configuration ought to work with the data that is available now. It's worth double-checking whether this is still needed by the time you come to implement it!

{:ook.search.facets
 [{:name "Product"
   :dimensions ["http://gss-data.org.uk/def/trade/property/dimension/product"
                "http://gss-data.org.uk/def/trade/property/dimension/sitc-4"
                "http://gss-data.org.uk/data/gss_data/trade/ons_pinkbook#dimension/pink-book-services"
                "http://gss-data.org.uk/data/gss_data/trade/ons_cpa#dimension/cpa-2008"
                "http://gss-data.org.uk/data/gss_data/trade/ons_mrets#dimension/mrets-product"]}
  {:name "Industry"
   :dimensions ["http://gss-data.org.uk/data/gss_data/trade/ons-uk-trade-in-services-by-industry-country-and-service-type-exports/pcn#dimension/industry"
                "http://gss-data.org.uk/def/trade/property/dimension/industry-section"
                "http://gss-data.org.uk/data/gss_data/trade/dcms-sectors-economic-estimates-year-trade-in-services#dimension/sector"
                "http://gss-data.org.uk/data/gss_data/trade/dcms-sectors-economic-estimates-year-trade-in-services#dimension/subsector"
                "http://gss-data.org.uk/def/trade/property/dimension/sic-2007"
                "http://gss-data.org.uk/def/trade/property/dimension/ons-functional-category"
                "http://gss-data.org.uk/data/gss_data/trade/ons-uk-trade-in-services-by-industry-country-and-service-type-exports/pcn#dimension/service-account"]}
  {:name "Geography"
   :dimensions ["http://gss-data.org.uk/data/gss_data/trade/ons-uk-trade-in-services-by-industry-country-and-service-type-exports/pcn#dimension/country"
                "http://gss-data.org.uk/data/gss_data/trade/ons-uk-sa-trade-in-goods#dimension/ons-partner-geography"
                "http://gss-data.org.uk/data/gss_data/trade/ons-reuk-service#dimension/nuts-geography"
                "http://gss-data.org.uk/data/gss_data/trade/ons_cpa#dimension/ons-partner-geography"
                "http://gss-data.org.uk/def/trade/property/dimension/country-of-ownership"
                "http://gss-data.org.uk/data/gss_data/trade/ons-quarterly-country-and-regional-gdp#dimension/reference-area"
                "http://gss-data.org.uk/data/gss_data/trade/dcms-sectors-economic-estimates-year-trade-in-services#dimension/country"
                "http://gss-data.org.uk/data/gss_data/trade/hmrc-regional-trade-statistics#dimension/uk-region"
                "http://gss-data.org.uk/data/gss_data/trade/hmrc-regional-trade-statistics#dimension/country"
                "http://gss-data.org.uk/def/trade/property/dimension/country-area"
                "http://gss-data.org.uk/data/gss_data/trade/ons_mrets#dimension/trade-area"]}
  {:name "Flow"
   :parent_dimension "http://gss-data.org.uk/def/trade/property/dimension/flow-directions"}
  {:name "Date"
   :parent_dimension "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod"}
  {:name "Basis"
   :parent_dimension "http://gss-data.org.uk/def/trade/property/dimension/international-trade-basis"}
  {:name "Seasonality"
   :parent_dimension "http://gss-data.org.uk/def/trade/property/dimension/seasonal-adjustments"}]}

In some cases we've had to replace the :parent_dimension with a :dimensions vector (which enumerates the values that will ultimately be involved). This is brittle (the config won't update automatically as add new datasets/ dimensions are added) but it will give us something to work with for the time being.

Note that the existing data is also missing a lot of codelists. There's not much we can do to workaround this so we'll just need to show some sort of error message, e.g. "there are no codelists available for this facet yet, sorry".

Robsteranium commented 3 years ago

Progress to date

37 introduces facet - the configuration, control and results columns

This only works for codelists, the next stage will be to extend this to allow code selection (and search).

Evolving requirements

The dimension selection problem has been rendered irrelevant/ unsolvable by the data. We've based got no dimension re-use so they don't help us to compare datasets. Instead we need to use parent-dimensions - there are so few of those that there seems little point in having custom-facets at this stage.

This simplifies the UI requirements somewhat - the facets only need to help them select codelists or codes across codelists.

Instead of presenting a table, we have been thinking about a tree view that could support the following requirements:

Candidate solutions

Codelist == Top-level Code

We could use top-level codes to represent “anything in this codelist”.

This would obviate the need to distinguish between code selections and codelist selections - the facet form is only ever supporting you to select codes.

This also means we can match observations (provide filtered-dataset links) as we’re working with specific codes, not the whole dimension (via the codelist).

The facet form could thus start by showing each of the codelists with the top-level code selected. You could then disclose a breakdown into each scheme and search over all of them at once (like the concept browser nested-view).

This would require that:

These data requirements won’t be met in the next few months.

Codelist == "Any" Code

Here's a wireframe of one idea that makes codelists the first level in a tree, with affordances for selecting "any" as a proxy for selecting a codelist.

multi-codelist-tree

Clicking "any" on the codelists would select the codelist-uri in the facet, set the checkboxes of children to "indeterminate" and collapse them all. If you then uncollapsed the children and checked one, it would set the others to unchecked.

Clicking "all" on a parent selects all the children. We might also have a "none" button.

Alternatively we could put the all/any/none affordance as a select-box against each codelist and parent code, like on Nomis:

image

Robsteranium commented 3 years ago

This comment provides a few ideas for queries

Create code(list) tree

Find roots (top concepts) of a codelist:

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "scheme": "http://business.data.gov.uk/companies/def/sic-2007/scheme"
          }
        }
      ],
      "must_not": [
        {
          "exists": {
            "field": "broader"
          }
        }
      ]
    }
  }
}

Find children of a code:

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "scheme": "http://business.data.gov.uk/companies/def/sic-2007/scheme"
          }
        },
        {
          "term": {
            "broader": "http://business.data.gov.uk/companies/def/sic-2007/A"
          }
        }
      ]
    }
  }
}

Ought to add the following to the codes index:

We might want to explore other mappings #49

Count observations by dataset that match code(list) selections

The monster below will find observations matching codes in any dimension, collapsing results to the first observation (identifying the relevant dimension for the links) and counting by dataset. This would work one per facet:

{
  "collapse": {
    "field": "qb:dataSet.@id"
  },
  "fields": [
    "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-clearances#dimension/period.@id",
    "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-duty-receipts#dimension/period.@id",
    "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-production#dimension/period.@id"
  ],
  "_source": false,
  "query": {
    "bool": {
      "should": [
        {
          "terms": {
            "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-clearances#dimension/period.@id": [
              "http://reference.data.gov.uk/id/year/2019"
            ]
          }
        },
        {
          "terms": {
            "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-duty-receipts#dimension/period.@id": [
              "http://reference.data.gov.uk/id/year/2019"
            ]
          }
        },
        {
          "exists": {
            "field": "data/gss_data/trade/hmrc-alcohol-bulletin/alcohol-bulletin-production#dimension/period.@id"
          }
        }
      ]
    }
  },
  "aggregations": {
    "ds_obs_count": {
      "terms": {
        "field": "qb:dataSet.@id"
      }
    }
  }
}

We might be able to extend this with more collapsing/ nesting to enumerate more codes and maybe with joins to get e.g. the codelist labels.

Alternatively we can merge the results with a code query in clojure.

Robsteranium commented 3 years ago

The above dataset-finding/ -describing query has some flaws but I think it's good enough for a start. I've created a separate issue for improvements #52.

Robsteranium commented 3 years ago

The interface asks users to select codes or codelists. We can encode the selections as a map from facet to a map from codelist to a (possibly empty) vector of codes:

{"Product" {"cl1" [], "cl2" ["c1","c2"]}
 "Geography" {"cl3" ["c3"]}}

We need to distinguish the facets for two reasons. First, the codelists could overlap between facets (e.g. trade partner/ report could use the same country list). Second, we ought to be able to rebuild the UI settings from the facet selections.

This doesn't contain any metadata - we only know the resource type by it's URI's position in the map. This makes things harder to make backwards-compatible changes later e.g. to incorporate dimension selections (not impossible as the URIs sets are disjoint, but we'd need to first determine the URI type by looking them up in each index!). A more explicit version that could express dimension or codelist selections would map from facet to an array of maps with keys for either :dimension or :codelist and optionally :codes

{"Product" [{ :codelist "cl1" }, { :codelist "cl2", :codes ["c1","c2"] }]
 "Geography" [{ :dimension "dim3", :codes ["c3"]}]}

I suspect this is premature generalisation.

The main dataset-finding/ -describing query needs to convert this into dimensions. We can infer the dimensions from the codelists (via ?dim qb:codeList ?cl). This could bring more of the facet's dimensions into scope than if we'd specified the dimension directly (as 2+ dimensions can use the same codelist).

For parametising the main query, given that dimensions shouldn't overlap between facets and a dimension shouldn't be specified with and without values, we can possibly just have a map from dimension to a (possibly empty) vector of codes (that merges across all the facets). For example:

{ "dim1" [], "dim2" ["c1" "c2"], "dim3" ["c3"] }

The response can be interpreted to distinguish the facets by looking at the collapsed results (i.e. looking at the first result(s) by dataset to find examples). Reaching something like this for each dataset and facet.

[{:ook/uri "data-by-area",
  :facets [{:name "location",
            :dimensions [{:ook/uri "area",
                          :codelist {:ook/uri "areas",
                                     :label "Areas"
                                     :examples [{:ook/uri "manchester"
                                                 :label "Manchester"}]}}]}]}]

The view can merge across dataset such that the cells look like "Areas (e.g. Manchester)".