NASA-PDS / registry-api

Web API service for the PDS Registry, providing the implementation of the PDS Search API (https://github.com/nasa-pds/pds-api) for the PDS Registry.
https://nasa-pds.github.io/pds-api
Apache License 2.0
2 stars 5 forks source link

As an API user, I want to be able to use the API for free text search #460

Closed jordanpadams closed 3 years ago

jordanpadams commented 3 years ago

Motivation

...so that I can perform a keyword or Google-like search on the registry and get reasonable results back.

Additional Details

This development with be later developed into a full natural language search feature. See NASA-PDS/pds-api#93

The premise of this task is to come up with the API design in order to enable keyword search. The actual result set will be refined per NASA-PDS/pds-api#93.

Acceptance Criteria

Given : some data products for a particular mission (e.g. insight) or targeting a particular planet (e.g. mars) When I perform: an API search for something like keyword=insight or keyword=mars Then I expect: it returns products that provide some fuzzy match for the keyword terms searched

Engineering Details

This ticket requires to update the web API specification to accept a free text search criteria instead of only {pds4 field attribute}=value criteria.

For this ticket, we, at minimum, what to allow for free text search utilizing ES default weighting. We will then want to investigate what we think should maybe be weighted a little more strongly to enable more robust search results. NASA-PDS/pds-api#49

jordanpadams commented 3 years ago

@tdddblog this is next on the list. we can meet to chat about this some more if needed. the acceptance criteria is not super detailed, but hopefully it provides some basic insight into what we are looking for

tloubrieu-jpl commented 3 years ago

the freetext search criteria is going to be available in existing data end-points (/bundles, /collections, /products ) in a keyword query parameter.

tdddblog commented 3 years ago

In those few schemas I have in my registry instance there are 128 description fields. I assume we have to search in all of them. We'll have to change Harvest to automatically merge them into custom "description" field. Few examples:

pds:Bundle/pds:description
pds:Collection/pds:description
pds:Document/pds:description
pds:Array/pds:description
geom:Geometry_Lander/geom:description
img:Brightness_Correction_File/img:description
img:Subframe/img:description
jordanpadams commented 3 years ago

@tdddblog I am starting to see a lot of this metadata cleanup coming now and down the road. Rather than require this at ingestion time, do we think some sort of post-processing tool should run in the background for the registries to perform this kind of "metadata curation" and update the records? I am just thinking as our natural language search capabilities evolve, it will be difficult to get everyone to re-ingest all their data. just a thought...

also, would this have any impact on weighting the returned results?

tdddblog commented 3 years ago

@jordanpadams Updating every document in Elasticsearch is very expensive operation. Elasticsearch would have to reindex every document. We can also try using "copy_to" fields, but then Registry Manager has to be updated. And it will only work with newly created fields & documents. Old documents indexed before "copy_to" was added have to be reindexed.

tloubrieu-jpl commented 3 years ago

For this ticket, @tdddblog will only consider the 'description' fields for free text search.

jordanpadams commented 3 years ago

done per NASA-PDS/registry-api-service#60