Swirrl / ook

Structural search engine
https://search-prototype.gss-data.org.uk/
Eclipse Public License 1.0
6 stars 0 forks source link

Harmonised codelists #81

Open Robsteranium opened 3 years ago

Robsteranium commented 3 years ago

The range of codelists makes it hard to pick codes when browsing or searching.

Where there's a formal, third-party taxonomy - like SITC vs CPA - it will be important to show these distinctions even if some users will only really care about finding a code for their purpose (regardless of which scheme it comes from).

In other cases the distinctions are essentially incidental because the necessary work required to harmonise and formalise the codes hasn't been done. This isn't really of any interest/ use to anyone.

In both cases, it might be useful to have a set of per-facet harmonised codelists. We could use these to present one set of options to the user (the harmonised codes) and interpret these with multiple equivalents (the original codes) behind the scenes. Thus the facet controls would use a different set of codes to those in the results table and PMD links.

We would lose any guarantee of MECE-ness - this is less of a concern for discovery versus e.g. aggregation.

It would be frustrating to find specific codes relevant to your interests that appear to be common but actually aren't. As such we might want to err on the side of having fewer broader harmonised codes so that we present fewer choices and each choice is quite permissive in the datasets it could match.

There's quite a few technologies we could explore for matching, not least splink but it might be interesting to see if sentence embeddings could support this (instead of token comparison) #25.