Swirrl / ook

Structural search engine
https://search-prototype.gss-data.org.uk/
Eclipse Public License 1.0
6 stars 0 forks source link

Natural-language search #84

Open Robsteranium opened 3 years ago

Robsteranium commented 3 years ago

This extends the ideas from the original mockups.

25 would gives us semantic search - i.e. code search within facets via sentence embeddings.

We could build upon these embeddings to do entity linking across all codes. This would ignoring the dimension-property configuration we have currently for each facet - instead the recognised entities would define the set of dimensions involved for a given search.

The UI would start by presenting an open-text search, much like google.

Any entities we're able to recognise (and link to resources in our database) would form columns (like the facets we have now), labelled as per the user's query.

Screen Shot 2021-05-06 at 16 19 20

The above shows each column also describing the dimension (i.e. that Germany has been interpreted with the Partner Geography dimension). We've since included the dimension in each cell. In any case we might want to allow the user to see/ customise the interpretation in the edit dialogue.

This would require an advanced natural-language-understanding pipeline, but would obviate the need to curate Q&A forms (#82) or a facet configuration (thus it would automatically work across all families).

Robsteranium commented 3 years ago

We don't necessarily need to have embeddings or entity-recognition to try this.

We could start with a simple NLP pipeline that would:

We could then derive facet-selections from the output (identifying via traversal code --> scheme --> dimension --> parent --> facet).

Ultimately we could do away with the facet configuration by creating facets on the fly based upon the parent (or orphan) dimensions entailed by the query (so you could edit the results with the matched parent-dimensions or expand the range of parent-dimensions with a further query).

In the meantime it would be useful to track which matches didn't fit into facets - not least for improving the property hierarchy.

Robsteranium commented 3 years ago

Indeed we might be able to implement that pipeline as an ES Analyser - then it'd just be a single match query against the codes index.

Robsteranium commented 3 years ago

I've created https://github.com/Swirrl/cogs-issues/issues/289 for moving this forward.

Robsteranium commented 2 years ago

We've now agreed to go with the google-style UI. We might return to a faceted comparison later.

I suggest we create new view for now (instead of deleting lots of code).

We basically want a single search box with one cube per results. The rich snippet beneath each shows the dimensions and a sample of values that match.

We could extend this to show all dimensions present in the cube (which could play a role in the decision between cubes). We could use a visual cue (like a background colour) to highlight those that matched. Alternatively we could only show those dimensions that matched.

We could extend highlighting even further to show matches within code labels using the highlighting feature from Elasticsearch.

Note that this UI doesn't need facets necessarily. Options:

We might also look to extend the search to match against non-cube-structural elements like the dataset description etc.

Robsteranium commented 2 years ago

I've begun work on a global-search branch for this.

At the moment this only goes query -> codes -> codelists -> dimensions -> datasets. Initial testing demonstrates that this isn't going to work as shared-codelists lead to false positive code matches against each dataset. This wouldn't be too bad if we were suggesting cubes that might have data (which later turned out not to) but it also breaks the PMD-links (leading to zero observations and a dead-end state where you can't unlock filters).

We will need to revise the logic to go: query -> codes -> observations -> datasets as per the original logic with facets.

I think we need to tackle the following before we can release it: