NASA-PDS / registry-api

Web API service for the PDS Registry, providing the implementation of the PDS Search API (https://github.com/nasa-pds/pds-api) for the PDS Registry.
https://nasa-pds.github.io/pds-api
Apache License 2.0
2 stars 5 forks source link

like operator does not work in q parameter #170

Open tloubrieu-jpl opened 2 years ago

tloubrieu-jpl commented 2 years ago

๐Ÿ› Describe the bug

Don't forget to remove the know bug in the documentation, see warning in https://nasa-pds.github.io/pds-api/guides/search/endpoints.html#query-string-syntax

๐Ÿ“œ To Reproduce

See request

curl --request GET 'http://localhost:8080/products?limit=10&q=pds:Identification_Area.pds:title like "InSight HP3 Rad*"&fields=pds:Identification_Area.pds:title' \
--header 'Accept: text/csv'

๐Ÿ•ต๏ธ Expected behavior

Should not return an empty result for the test bundle loaded in the registry.

๐Ÿ“š Version of Software Used

all versions

๐Ÿฉบ Test Data / Additional context

๐ŸžScreenshots

๐Ÿ–ฅ System Info


๐Ÿฆ„ Related requirements

โš™๏ธ Engineering Details

al-niessner commented 2 years ago

@tloubrieu-jpl

As stated in #134, like does work just not with the syntax currently documented which is to be fixed with #159. To do like "InSight HP3 Rad*" the correct syntax is `like "InSight +HP3 +Rad". This makes it more like google and less like unix globbing.

No promises but like "InSight HP3 Rad" may work equally as well. Again see the search syntax for best way.

tloubrieu-jpl commented 1 year ago

What need to work for this build 13.1 is a proper way to handle like syntax for text search (could be with the keyword parameter) and then update the documentation to clarify that use cases to the users.

msbentley commented 10 months ago

Hi @tloubrieu-jpl just to understand, when is the likely fix for this in production? Thanks!

al-niessner commented 9 months ago

@alexdunnjpl @jordanpadams @tloubrieu-jpl

Thank you for this one. It took quite a while for me to figure it out. First, my data is different but the idea is the same as I am searching the same field just with different data. It had me so stumped which is what made if fun. Sorry, but this is going to be a really long message as the answer is pretty complex. Suffice it to say, there is nothing wrong with the code just us.

Here is the same that should return 11000+ answers but returns nothing:

$ curl --get 'http://localhost:8080/products'   --data-urlencode 'limit=10' --data-urlencode 'fields=pds:Identification_Area.pds:title'   --data-urlencode 'start=0'   -H 'accept: application/kvp+json' --data-urlencode 'q=pds:Identification_Area.pds:comment like "Mars2020"'
{
  "summary":{"q":"pds:Identification_Area.pds:comment like \"Mars2020\"","hits":0,"took":26,"search_after":[],"limit":10,"sort":[],"properties":[]},
  "data":[
  ]
}

However, if we treat Mars2020 as prefix and not a token, then it works just fine:

$ curl --get 'http://localhost:8080/products'   --data-urlencode 'limit=10' --data-urlencode 'fields=pds:Identification_Area.pds:title'   --data-urlencode 'start=0'   -H 'accept: application/kvp+json' --data-urlencode 'q=pds:Identification_Area.pds:title like "Mars2020*"'
{
  "summary":{"q":"pds:Identification_Area.pds:title like \"Mars2020*\"","hits":11126,"took":34,"search_after":[],"limit":10,"sort":[],"properties":["pds:Identification_Area.pds:title"]},
  "data":[    {
      "pds:Identification_Area.pds:title":"Mars2020 PIXL_ENG E08 Observational Product - pe__0018_0668545793_000e08__00305780044569690000___j.CSV"    },
><...snip...><

Now I know our instinct is to say well the '*' wildcard it but no it just means prefix. Spent a while beating my head on this and then tried to say not Mars2020 and it surprised me by working. Fudge.

$ curl --get 'http://localhost:8080/products'   --data-urlencode 'limit=10' --data-urlencode 'fields=pds:Identification_Area.pds:title'   --data-urlencode 'start=0'   -H 'accept: application/kvp+json' --data-urlencode 'q=pds:Identification_Area.pds:title like "-Mars2020"'
{
  "summary":{"q":"pds:Identification_Area.pds:title like \"-Mars2020\"","hits":11180,"took":21,"search_after":[],"limit":10,"sort":[],"properties":["pds:Identification_Area.pds:title"]},
  "data":[    {
      "pds:Identification_Area.pds:title":"Mars2020 PIXL_ENG E08 Observational Product - pe__0018_0668545793_000e08__00305780044569690000___j.CSV"    },
><...snip...><

Spent more time increasing the damage to the wall. Then it dawned on me. It is not string. From the registry:

        "pds:Identification_Area/pds:title" : {
          "type" : "keyword"
        },

But this one is:

        "pds:Internal_Reference/pds:comment" : {
          "type" : "text"
        },

It behaves as it should with like and all the rest:

$ curl --get 'http://localhost:8080/products'   --data-urlencode 'limit=10' --data-urlencode 'fields=pds:Internal_Reference.pds:comment'   --data-urlencode 'start=0'   -H 'accept: application/kvp+json' --data-urlencode 'q=pds:Internal_Reference.pds:comment like "Mars2020"'
{
  "summary":{"q":"pds:Internal_Reference.pds:comment like \"Mars2020\"","hits":11023,"took":18,"search_after":[],"limit":10,"sort":[],"properties":["pds:Internal_Reference.pds:comment"]},
  "data":[    {
      "pds:Internal_Reference.pds:comment":"This is the PDS4 logical identifier for the Mars2020 Mission.|This is the PDS4 logical identifier for the Mars 2020 spacecraft.|This is the PDS4 logical identifier for the Pixl Spectrometer onboard the Mars 2020 spacecraft.|This is the PDS4 logical identifier for the planet - mars."    },
><...snip...><

and then filter out just some of them with (pay attention to total hits):

$ curl --get 'http://localhost:8080/products'   --data-urlencode 'limit=10' --data-urlencode 'fields=pds:Internal_Reference.pds:comment'   --data-urlencode 'start=0'   -H 'accept: application/kvp+json' --data-urlencode 'q=pds:Internal_Reference.pds:comment like "Mars2020 +Spectrometer"'
{
  "summary":{"q":"pds:Internal_Reference.pds:comment like \"Mars2020 +Spectrometer\"","hits":1119,"took":20,"search_after":[],"limit":10,"sort":[],"properties":["pds:Internal_Reference.pds:comment"]},
  "data":[    {
      "pds:Internal_Reference.pds:comment":"This is the PDS4 logical identifier for the Mars2020 Mission.|This is the PDS4 logical identifier for the Mars 2020 spacecraft.|This is the PDS4 logical identifier for the Pixl Spectrometer onboard the Mars 2020 spacecraft.|This is the PDS4 logical identifier for the planet - mars."    },
><...snip...><

So, it is the DB design that precludes any of this from working like you expect hence us being wrong (bad expectation). Has nothing to do with code. There is not search keywords for token X in opensearch. Can only do it with text. Why did we choose keywords for text fields?

alexdunnjpl commented 9 months ago

Why did we choose keywords for text fields?

Can't speak to the decision itself, but text is for freeform text upon which you expect to perform partial matching, and keyword is for "atomic" values upon which you expect to match term queries (which saves write-time analysis overhead... not sure if there are other performance benefits, like at query-time). So keyword makes sense for many of the properties, but as you say, title ain't one of them.

al-niessner commented 9 months ago

That is the point of the question. The field had "title" in it. KeyWORD?

On Thu, Dec 14, 2023, 12:23 Alex Dunn @.***> wrote:

Why did we choose keywords for text fields?

Can't speak to the decision itself, but text is for freeform text upon which you expect to perform partial matching, and keyword is for "atomic" values upon which you expect to match term queries. So keyword makes sense for many of the properties, but as you say, title ain't one of them.

โ€” Reply to this email directly, view it on GitHub https://github.com/NASA-PDS/registry-api/issues/170#issuecomment-1856529914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIUBIW6WQGUDXRKZ3OPKSLYJNN55AVCNFSM543D7OH2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBVGY2TEOJZGE2A . You are receiving this because you were assigned.Message ID: @.***>

alexdunnjpl commented 9 months ago

Like I said, I agree title doesn't make sense, but it's worth noting that keyword isn't just for single words - multi-word strings are fine, provided they're atomic from a use-case perspective (i.e. you only ever intend to match them verbatim, not analyse them).

jordanpadams commented 9 months ago

@alexdunnjpl @al-niessner hmmm. Looking at the docs, it seems like we definitely need to support keyword for faceting/aggregation purposes (from my quick Google, OpenSearch does not support aggregations on text fields), but text for supporting like queries. How do we decipher between the 2? Not sure...

From the documentation here, it sounds like we could potentially support both using the fields parameter when we create/update the schema:

To index the same string in several ways (for example, as a keyword and text), provide the fields parameter. You can specify one version of the field to be used for search and another to be used for sorting and aggregations.

But then how do we decide which fields we support both, versus which we just have text. Thoughts?

jordanpadams commented 9 months ago

@al-niessner @alexdunnjpl maybe we use the Information Model types for this. https://pds.nasa.gov/datastandards/documents/im/v1/index_1L00.html#19.37%C2%A0%C2%A0class_pds_character_data_type

From here, I think any attributes with the following types should be text only:

Unfortunately, some of the other ASCII_String* types (e.g. ASCII_Short_String_Collapsed) we cannot include because those are often used for string attributes that may be something we will want to facet on.

jordanpadams commented 9 months ago

@al-niessner @alexdunnjpl I will create a new task for us to update the schema to support this. This will most likely be a blocker for enabling faceting through the API on some fields.