Open tloubrieu-jpl opened 2 years ago
@tloubrieu-jpl
As stated in #134, like does work just not with the syntax currently documented which is to be fixed with #159. To do like "InSight HP3 Rad*"
the correct syntax is `like "InSight +HP3 +Rad". This makes it more like google and less like unix globbing.
No promises but like "InSight HP3 Rad"
may work equally as well. Again see the search syntax for best way.
What need to work for this build 13.1 is a proper way to handle like
syntax for text search (could be with the keyword parameter) and then update the documentation to clarify that use cases to the users.
Hi @tloubrieu-jpl just to understand, when is the likely fix for this in production? Thanks!
@alexdunnjpl @jordanpadams @tloubrieu-jpl
Thank you for this one. It took quite a while for me to figure it out. First, my data is different but the idea is the same as I am searching the same field just with different data. It had me so stumped which is what made if fun. Sorry, but this is going to be a really long message as the answer is pretty complex. Suffice it to say, there is nothing wrong with the code just us.
Here is the same that should return 11000+ answers but returns nothing:
$ curl --get 'http://localhost:8080/products' --data-urlencode 'limit=10' --data-urlencode 'fields=pds:Identification_Area.pds:title' --data-urlencode 'start=0' -H 'accept: application/kvp+json' --data-urlencode 'q=pds:Identification_Area.pds:comment like "Mars2020"'
{
"summary":{"q":"pds:Identification_Area.pds:comment like \"Mars2020\"","hits":0,"took":26,"search_after":[],"limit":10,"sort":[],"properties":[]},
"data":[
]
}
However, if we treat Mars2020 as prefix and not a token, then it works just fine:
$ curl --get 'http://localhost:8080/products' --data-urlencode 'limit=10' --data-urlencode 'fields=pds:Identification_Area.pds:title' --data-urlencode 'start=0' -H 'accept: application/kvp+json' --data-urlencode 'q=pds:Identification_Area.pds:title like "Mars2020*"'
{
"summary":{"q":"pds:Identification_Area.pds:title like \"Mars2020*\"","hits":11126,"took":34,"search_after":[],"limit":10,"sort":[],"properties":["pds:Identification_Area.pds:title"]},
"data":[ {
"pds:Identification_Area.pds:title":"Mars2020 PIXL_ENG E08 Observational Product - pe__0018_0668545793_000e08__00305780044569690000___j.CSV" },
><...snip...><
Now I know our instinct is to say well the '*' wildcard it but no it just means prefix. Spent a while beating my head on this and then tried to say not Mars2020 and it surprised me by working. Fudge.
$ curl --get 'http://localhost:8080/products' --data-urlencode 'limit=10' --data-urlencode 'fields=pds:Identification_Area.pds:title' --data-urlencode 'start=0' -H 'accept: application/kvp+json' --data-urlencode 'q=pds:Identification_Area.pds:title like "-Mars2020"'
{
"summary":{"q":"pds:Identification_Area.pds:title like \"-Mars2020\"","hits":11180,"took":21,"search_after":[],"limit":10,"sort":[],"properties":["pds:Identification_Area.pds:title"]},
"data":[ {
"pds:Identification_Area.pds:title":"Mars2020 PIXL_ENG E08 Observational Product - pe__0018_0668545793_000e08__00305780044569690000___j.CSV" },
><...snip...><
Spent more time increasing the damage to the wall. Then it dawned on me. It is not string. From the registry:
"pds:Identification_Area/pds:title" : {
"type" : "keyword"
},
But this one is:
"pds:Internal_Reference/pds:comment" : {
"type" : "text"
},
It behaves as it should with like and all the rest:
$ curl --get 'http://localhost:8080/products' --data-urlencode 'limit=10' --data-urlencode 'fields=pds:Internal_Reference.pds:comment' --data-urlencode 'start=0' -H 'accept: application/kvp+json' --data-urlencode 'q=pds:Internal_Reference.pds:comment like "Mars2020"'
{
"summary":{"q":"pds:Internal_Reference.pds:comment like \"Mars2020\"","hits":11023,"took":18,"search_after":[],"limit":10,"sort":[],"properties":["pds:Internal_Reference.pds:comment"]},
"data":[ {
"pds:Internal_Reference.pds:comment":"This is the PDS4 logical identifier for the Mars2020 Mission.|This is the PDS4 logical identifier for the Mars 2020 spacecraft.|This is the PDS4 logical identifier for the Pixl Spectrometer onboard the Mars 2020 spacecraft.|This is the PDS4 logical identifier for the planet - mars." },
><...snip...><
and then filter out just some of them with (pay attention to total hits):
$ curl --get 'http://localhost:8080/products' --data-urlencode 'limit=10' --data-urlencode 'fields=pds:Internal_Reference.pds:comment' --data-urlencode 'start=0' -H 'accept: application/kvp+json' --data-urlencode 'q=pds:Internal_Reference.pds:comment like "Mars2020 +Spectrometer"'
{
"summary":{"q":"pds:Internal_Reference.pds:comment like \"Mars2020 +Spectrometer\"","hits":1119,"took":20,"search_after":[],"limit":10,"sort":[],"properties":["pds:Internal_Reference.pds:comment"]},
"data":[ {
"pds:Internal_Reference.pds:comment":"This is the PDS4 logical identifier for the Mars2020 Mission.|This is the PDS4 logical identifier for the Mars 2020 spacecraft.|This is the PDS4 logical identifier for the Pixl Spectrometer onboard the Mars 2020 spacecraft.|This is the PDS4 logical identifier for the planet - mars." },
><...snip...><
So, it is the DB design that precludes any of this from working like you expect hence us being wrong (bad expectation). Has nothing to do with code. There is not search keywords for token X in opensearch. Can only do it with text. Why did we choose keywords for text fields?
Why did we choose keywords for text fields?
Can't speak to the decision itself, but text
is for freeform text upon which you expect to perform partial matching, and keyword
is for "atomic" values upon which you expect to match term queries (which saves write-time analysis overhead... not sure if there are other performance benefits, like at query-time). So keyword
makes sense for many of the properties, but as you say, title
ain't one of them.
That is the point of the question. The field had "title" in it. KeyWORD?
On Thu, Dec 14, 2023, 12:23 Alex Dunn @.***> wrote:
Why did we choose keywords for text fields?
Can't speak to the decision itself, but text is for freeform text upon which you expect to perform partial matching, and keyword is for "atomic" values upon which you expect to match term queries. So keyword makes sense for many of the properties, but as you say, title ain't one of them.
โ Reply to this email directly, view it on GitHub https://github.com/NASA-PDS/registry-api/issues/170#issuecomment-1856529914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIUBIW6WQGUDXRKZ3OPKSLYJNN55AVCNFSM543D7OH2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBVGY2TEOJZGE2A . You are receiving this because you were assigned.Message ID: @.***>
Like I said, I agree title
doesn't make sense, but it's worth noting that keyword
isn't just for single words - multi-word strings are fine, provided they're atomic from a use-case perspective (i.e. you only ever intend to match them verbatim, not analyse them).
@alexdunnjpl @al-niessner hmmm. Looking at the docs, it seems like we definitely need to support keyword
for faceting/aggregation purposes (from my quick Google, OpenSearch does not support aggregations on text
fields), but text
for supporting like
queries. How do we decipher between the 2? Not sure...
From the documentation here, it sounds like we could potentially support both using the fields
parameter when we create/update the schema:
To index the same string in several ways (for example, as a keyword and text), provide the fields parameter. You can specify one version of the field to be used for search and another to be used for sorting and aggregations.
But then how do we decide which fields we support both, versus which we just have text
. Thoughts?
@al-niessner @alexdunnjpl maybe we use the Information Model types for this. https://pds.nasa.gov/datastandards/documents/im/v1/index_1L00.html#19.37%C2%A0%C2%A0class_pds_character_data_type
From here, I think any attributes with the following types should be text only:
Unfortunately, some of the other ASCII_String* types (e.g. ASCII_Short_String_Collapsed) we cannot include because those are often used for string attributes that may be something we will want to facet on.
@al-niessner @alexdunnjpl I will create a new task for us to update the schema to support this. This will most likely be a blocker for enabling faceting through the API on some fields.
๐ Describe the bug
Don't forget to remove the know bug in the documentation, see warning in https://nasa-pds.github.io/pds-api/guides/search/endpoints.html#query-string-syntax
๐ To Reproduce
See request
๐ต๏ธ Expected behavior
Should not return an empty result for the test bundle loaded in the registry.
๐ Version of Software Used
all versions
๐ฉบ Test Data / Additional context
๐Screenshots
๐ฅ System Info
๐ฆ Related requirements
โ๏ธ Engineering Details