Support search in vocabulary metadata

nichtich commented 3 years ago

Both /suggest and /search endpoint search in concepts but for BARTOC we want to search in vocabularies. How about:

add a text index for vocabulary metadata
add parameter type to /suggest and /search with default value http://www.w3.org/2004/02/skos/core#Concept to select which item types to search in (concepts or vocabularies, the latter http://www.w3.org/2004/02/skos/core#ConceptScheme)

stefandesu commented 3 years ago

As long as this doesn't cause problems for the already existing type parameter for /suggest and /search (used for example in DANTE and the GND API), we could implement this.

stefandesu commented 3 years ago

Alternatively, we could add /voc/search and /voc/suggest instead.

stefandesu commented 3 years ago

@nichtich Which fields should be searchable? Due to the nature of JSKOS fields (i.e. one key per language), we can't simply add an index for those fields, but rather have to, every time when a vocabulary is added or changed, create an extra field and have an index on that field (like we do for concepts).

If possible, we should generalize the way we do it for concepts and apply it to schemes as well. What we're currently doing there is:

notation is searchable by prefixes (i.e. when searching for "123", it'll show anything starting with "123")
prefLabel and altLabel are searchable by prefixes and suffixes (i.e. searching for "Pädag" will return both "Pädagogische Soziologie" and "Sozialpädagogik")
all values in creator, definition, scopeNote, and editorialNote are combined in one array, put into a normal MongoDB text index, and searched by exact matches (i.e. if a concept has an editorialNote with content "Bankbetriebslehre s. QK 300 Kapitalflussrechnung s. QP 828 Liquiditätstheorie s. QC 320", searching for "Kapitalfluss" will not return it, but searching for "Kapitalflussrechnung" will)

Then, after having a set of results, those results will go through a custom scoring algorithm to determine the best order. This works very well in my opinion, but gets slow if the Mongo result has too many matches (can happen in RVK for example). If we only have a few thousand schemes anyway, this shouldn't be an issue though.

Should I try to apply the same search implementation from concepts to schemes as well?

nichtich commented 3 years ago

This sounds well. Most important is to be able to find a concept scheme by words in its name or abstract. Ranking will not be perfect but ok, this would require a text retrieval engine with support for more sophisticated search features such as drilldown.

stefandesu commented 3 years ago

I added a first implementation and some tests. Since there are a lot of files that changed, it would be good if you could take a look, @nichtich. If you're trying it out with BARTOC, don't forgot to reimport the schemes and rebuild the indexes (./bin/import.js --indexes). The indexes need to be part of the new import script (#101), and maybe we should have endpoints to create indexes as well. 🤔

stefandesu commented 3 years ago

Also, is there a need to indicate via /status that /voc/search and /voc/suggest exist? DANTE does not have these endpoints, so maybe having a way to determine this would be good.

nichtich commented 3 years ago

Also, is there a need to indicate via /status that /voc/search and /voc/suggest exist?

Yes, all endpoints are explicitly listed via /status (which might later be extended to Swagger #23)

stefandesu commented 3 years ago

Yes, all endpoints are explicitly listed via /status (which might later be extended to Swagger #23)

Question: How should these endpoints be listed there? Currently, the properties do not necessarily represent the endpoint path (i.e. property schemes -> /voc, property top -> /voc/top).

stefandesu commented 3 years ago

Question: How should these endpoints be listed there? Currently, the properties do not necessarily represent the endpoint path (i.e. property schemes -> /voc, property top -> /voc/top).

Still an open question @nichtich.

nichtich commented 3 years ago

How about voc-search and voc-suggest?

gbv / jskos-server

Support search in vocabulary metadata #121