Swirrl / ook

Structural search engine
https://search-prototype.gss-data.org.uk/
Eclipse Public License 1.0
6 stars 0 forks source link

Investigate why some cells have no values #58

Closed Robsteranium closed 3 years ago

Robsteranium commented 3 years ago

With all geo codelists selected, some rows (2 and 5) have no cell values in the geo column.

Not sure how this could be. Maybe their dimension points to a codelist but they don't use any of it's values...

Robsteranium commented 3 years ago

This is being caused by some codes being in more than one scheme, e.g.

{
  "@id": "http://data.europa.eu/nuts/code/UKC",
  "label": "NORTH EAST (ENGLAND)",
  "scheme": [
    "http://data.europa.eu/nuts/scheme/2010",
    "http://data.europa.eu/nuts/scheme/2016",
    "http://data.europa.eu/nuts/scheme/2013",
    "data/gss_data/trade/ons-international-trade-in-services-by-subnational-areas-of-the-uk#scheme/location",
    "data/gss_data/trade/ons-international-exports-of-services-from-subnational-areas-of-the-uk#scheme/service-origin-geography",
    "data/gss_data/trade/ons-quarterly-country-and-regional-gdp#scheme/reference-area"
  ]
}

This breaks the assumption that we can group codes by codelist in the cells.

We could ofc still do this, but then the same UKC code would appear 6 times (once under each scheme). Indeed we could already see that the same code is used in other dataset-specific schemes in other rows - that's the very purpose of the table!

We could try to filter the list of schemes to those relevant - e.g. removing those dataset-specific schemes from other datasets. Even if we could easily determine this we would still have the multiple harmonised schemes (here one per NUTS version). This might be useful information, but it's not particularly relevant to the dataset search/ comparison because the filters themselves express all the user cares about codelist versions (whether their code of interest is present).

We might just need to remove the codelist grouping altogether. This grouping is less important given that mixing schemes within datasets will be rarer than between them. We could still possibly provide this information (e.g. with a popover) but not use it to structure the layout. Instead we'd just show an ellipsised list of codes.

The facet match would then have the codelist level removed, looking instead like:

{:facets
  ({:name "Geography",
    :dimensions
    ({:ook/uri
      "data/gss_data/trade/ons-quarterly-country-and-regional-gdp#dimension/reference-area",
      :codes
      ({:ook/uri "http://data.europa.eu/nuts/code/UKC",
        :ook/type "skos:Concept",
        :priority ["2" "6"],
        :label "NORTH EAST (ENGLAND)",
        :narrower
        ["http://data.europa.eu/nuts/code/UKC1"
         "http://data.europa.eu/nuts/code/UKC2"],
        :broader
        ["http://data.europa.eu/nuts/code/UK"
         "data/gss_data/trade/international-trade-in-services-by-subnational-areas-of-the-uk#concept-scheme/location/nuts"],
        :notation "UKC",
        :scheme
        ["http://data.europa.eu/nuts/scheme/2010"
         "http://data.europa.eu/nuts/scheme/2016"
         "http://data.europa.eu/nuts/scheme/2013"
         "data/gss_data/trade/ons-international-trade-in-services-by-subnational-areas-of-the-uk#scheme/location"
         "data/gss_data/trade/ons-international-exports-of-services-from-subnational-areas-of-the-uk#scheme/service-origin-geography"
         "data/gss_data/trade/ons-quarterly-country-and-regional-gdp#scheme/reference-area"],
        :used "false"})})})}

In fact we might like to enrich this with codelist labels if we're going to show them in a popover.

Robsteranium commented 3 years ago

Ok, working this through... it gets confusing because you can mix schemes by facet even with 1:1 dimension:codelist because the facet combines dimensions. We can distinguish these using the dimensions as grouping variable (rather than codelists as originally planned).

Robsteranium commented 3 years ago

We've now used dimension as a grouping variable and lifted the query size limits. This seems to fill most of the blanks but some remain.

e.g. this search for Germany doesn't seem to include an example code on for the "ONS UK total trade" dataset. The count is correct (filters observations for Germany) but the cell is blank and the link is wrong.

Robsteranium commented 3 years ago

This can sometimes be cause be sparsity e.g. this search shows a dataset which does include "BOP Services" and "Exports", but the first-matched observation for "BOP Services: Net financial transactions" doesn't match "Flow: Exports".

There may sometimes be no single observation that does both or it might be that the collapse just doesn't happen to find one with both (which might be solved by #52).

Robsteranium commented 3 years ago

I've got a draft implementation for #52 which doesn't appear to solve either of the above two cases :frowning_face:

Robsteranium commented 3 years ago

I've recreated the example from above with all geo codelists selected using the latest data from the beta environment. Now all the cells are populated.

Robsteranium commented 3 years ago

Redoing the above example for Germany with the new data confirms this is still a problem.

Robsteranium commented 3 years ago

Each of the previous examples is now solved on #68 (this mostly consists of increasing the default query size from 10).

One example was due to the child-dimension not being tied to the facet's parent dimension via rdfs:subPropertyOf.

Closing for now but we can re-open if new examples appear.