25 would gives us semantic search - i.e. code search within facets via sentence embeddings.

We could build upon these embeddings to do entity linking across all codes. This would ignoring the dimension-property configuration we have currently for each facet - instead the recognised entities would define the set of dimensions involved for a given search.

The UI would start by presenting an open-text search, much like google.

Any entities we're able to recognise (and link to resources in our database) would form columns (like the facets we have now), labelled as per the user's query.

Screen Shot 2021-05-06 at 16 19 20

The above shows each column also describing the dimension (i.e. that Germany has been interpreted with the Partner Geography dimension). We've since included the dimension in each cell. In any case we might want to allow the user to see/ customise the interpretation in the edit dialogue.

This would require an advanced natural-language-understanding pipeline, but would obviate the need to curate Q&A forms (#82) or a facet configuration (thus it would automatically work across all families).

Robsteranium commented 3 years ago

We don't necessarily need to have embeddings or entity-recognition to try this.

We could start with a simple NLP pipeline that would:

receive open text query as input e.g. "import cars from Germany 2019"
tokenise it e.g. ["import" "cars" "from" "Germany" "2019"] (possibly with downcasing)
remove stop words e.g. ["import" "cars" "Germany" "2019"]
for each token, find codes whose labels match

We could then derive facet-selections from the output (identifying via traversal code --> scheme --> dimension --> parent --> facet).

Ultimately we could do away with the facet configuration by creating facets on the fly based upon the parent (or orphan) dimensions entailed by the query (so you could edit the results with the matched parent-dimensions or expand the range of parent-dimensions with a further query).

In the meantime it would be useful to track which matches didn't fit into facets - not least for improving the property hierarchy.

Robsteranium commented 3 years ago

Indeed we might be able to implement that pipeline as an ES Analyser - then it'd just be a single match query against the codes index.

Robsteranium commented 3 years ago

I've created https://github.com/Swirrl/cogs-issues/issues/289 for moving this forward.

Robsteranium commented 2 years ago

We've now agreed to go with the google-style UI. We might return to a faceted comparison later.

I suggest we create new view for now (instead of deleting lots of code).

We basically want a single search box with one cube per results. The rich snippet beneath each shows the dimensions and a sample of values that match.

We could extend this to show all dimensions present in the cube (which could play a role in the decision between cubes). We could use a visual cue (like a background colour) to highlight those that matched. Alternatively we could only show those dimensions that matched.

We could extend highlighting even further to show matches within code labels using the highlighting feature from Elasticsearch.

Note that this UI doesn't need facets necessarily. Options:

Retain facets and facet configuration, show facets on rich snippets. This would mean more consistency across cubes (each would show the same set of dimensions with the same facet label (instead of dataset-specific variations).
Drop facets and show parent dimensions on snippets. This would mean cubes could show a different set of dimensions - being less consistent but allowing us to include any cubes and not just those for which facets had been configured. The labelling would at least be consistent (as we're looking at parent dimensions).
Drop facets and show dataset-specific dimensions. This requires no harmonisation (of parent dimensions) or facet configuration. The UI would show all the different labels used in different datasets.

We might also look to extend the search to match against non-cube-structural elements like the dataset description etc.

Robsteranium commented 2 years ago

I've begun work on a global-search branch for this.

At the moment this only goes query -> codes -> codelists -> dimensions -> datasets. Initial testing demonstrates that this isn't going to work as shared-codelists lead to false positive code matches against each dataset. This wouldn't be too bad if we were suggesting cubes that might have data (which later turned out not to) but it also breaks the PMD-links (leading to zero observations and a dead-end state where you can't unlock filters).

We will need to revise the logic to go: query -> codes -> observations -> datasets as per the original logic with facets.

I think we need to tackle the following before we can release it:

[X] empty state - difference between no search and no results!
[X] pmd links
[x] remove stop words from query before searching ("of" etc)
[x] result ranking
- datasets with more dimensions with matches should come higher
- more codes is better than fewer but +1 dimension is better than +1 code
- could look at code scores provided by ES
[x] rewrite search to use observations instead of just codelists
[x] include counts on results view
[x] visual hierarchy - snippet ought to be closer to title
[x] deploy/ run etl/ switch LB
[ ] extend search to match title, description, publisher etc
[ ] highlight match of terms within codes
[ ] pagination

Swirrl / ook

Natural-language search #84

25 would gives us semantic search - i.e. code search within facets via sentence embeddings.