NASA-PDS / registry-api

Web API service for the PDS Registry, providing the implementation of the PDS Search API (https://github.com/nasa-pds/pds-api) for the PDS Registry.
https://nasa-pds.github.io/pds-api
Apache License 2.0
2 stars 5 forks source link

As an API user, I want to perform a search using wildcards #457

Closed jordanpadams closed 3 years ago

jordanpadams commented 3 years ago

For more information on how to populate this new feature request, see the PDS Wiki on User Story Development:

https://github.com/NASA-PDS/nasa-pds.github.io/wiki/Issue-Tracking#user-story-development

Motivation

...so that I can more easily search for a product without knowing the entire LID

Additional Details

As stated in the PDS API Spec:

The PDS Search API also supports wild cards ? and . A search with no q parameter specified will default to q= (search for all possible records).

Must also use like operator to search for the wildcard.

See specification https://docs.google.com/document/d/16d0MVh48bFLvWsa5-B_Hy-cby1rGWdnNojWOJpUcOvA/edit#heading=h.dihrtzltiwag

Acceptance Criteria

Given There is a product with some LID like urn:nasa:pds:my_bundle:my_collection:my_product When I perform A query against the the products endpoint like GET /products?q=lid like "*my_bundle*" Then I expect the API results to include, at minimum the product metadata for urn:nasa:pds:my_bundle:my_collection:my_product (and any other products that contain my_bundle in their lid.

Given There is a product with some title like this is my product foobar When I perform A query against the the products endpoint like GET /products?q=title like "*foobar*" Then I expect the API results to include, at minimum the product metadata for the product with the title this is my product foobar (and any other products that contain foobar in their title

jordanpadams commented 3 years ago

@tloubrieu-jpl FYI, i descoped NASA-PDS/pds-api#48 for this capability since this is important for both the user community to play around with, but it is also one workaround for https://github.com/NASA-PDS/pds-deep-archive/issues/7

al-niessner commented 3 years ago

@jordanpadams @tloubrieu-jpl

What repo does this api-search-query-lexer-0.0.1-SNAPSHOT.jar come from? I guess this can be abstracted into given a dependency, how do I find the repo? For the wildcard, death occurs in this jarfile because of the token parsing of the query lid eq *text* from a seemingly infinite loop as the JVM runs out of memory while growing an Array. I need into the code to see why the code does not just get 3 tokens.

Found it. github search did not find it well for some reason.

tloubrieu-jpl commented 3 years ago

@al-niessner @jordanpadams sorry I missed the comments on that earlier.

@al-niessner I can help here since I developped the antlr parser. I would suggest first to test hardcoded wildcard in elasticsearch before we figure out how to parse the API query and create the related elasticsearch request.

al-niessner commented 3 years ago

@tloubrieu-jpl @jordanpadams

Moving into a holding pattern. Would be nice if issue 4 was completed and merged into master.

Have a bad work-around in the registry-api-service to handle quotes. Noticed that gov.nasa.pds.api.engineering.elasticsearch.Antlr4SearchListener strips the quotes as well. This means that removing them requires a big of run/debug to find all the places that are already trying to manage this problem.

Lastly, successfully have the query being processed in the registry-api-service, but it returns nothing when I am pretty sure it should. I need to spend time with ES to see if the search works in general and if not why.

Here is the search JSON:


SearchRequest{searchType=QUERY_THEN_FETCH, indices=[registry], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=null, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={"from":0,"size":100,"timeout":"60s","query":{"bool":{"must":[{"bool":{"must":[{"bool":{"must":[{"term":

{"lid":{"value":"*pdart14_meap*","boost":1.0}}

}],"adjust_pure_negative":true,"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}}}}
tloubrieu-jpl commented 3 years ago

@al-niessner I have created a pull request for new development (improved pagination, field exists criteria). The branch was based on issue_4 on which I corrected a bug.

You can review it

al-niessner commented 3 years ago

@tloubrieu-jpl @jordanpadams add whoever else you think needs to read this.

The wildcard search is not working as this ticket suggests. Some tweaks were needed just to get the wildcard request from the input to the code that translates the URL language to elastic search JSON. The idea is to find the query in the acceptance criteria above of GET /products?q=lid eq "*my_bundle*" which IRL testing has to look like this GET /products?q="lid%20eq%20%22*pdart14_meap*%22". Sorry about %22 but they are currently necessary and will be removed in the future and probably because of responses to this issue.

Given the IRL testing query, ElasticSearchRegistrySearchRequestBuilder.getSearchProductsRequest() produces this elastic search JSON query (sorry about the size but is how the query language works):

{
    "from":0,
    "size":100,
    "timeout":"60s",
    "query":
    {
        "bool":
        {
            "must":[
                {
                    "bool":
                    {
                        "must":[
                            {
                                "bool":
                                {
                                    "must":[
                                        {
                                            "term":
                                            {
                                                "lid":
                                                {
                                                    "value":"*pdart14_meap*",
                                                    "boost":1.0
                                                }
                                            }
                                        }],
                                    "adjust_pure_negative":true,
                                    "boost":1.0
                                }
                            }],
                        "adjust_pure_negative":true,
                        "boost":1.0
                    }
                }],
            "adjust_pure_negative":true,
            "boost":1.0
        }
    }
}

The problem I have is that it does not work. The shard is a success for all that is worth and there are no failures, which adds no value, because the hits are 0. Digging into elastic search, it turns out the correct JSON query for the IRL testing case that returns 17 hits is:

{
    "from":0,
    "size":100,
    "timeout":"60s",
    "query":
    {
        "wildcard":
        {
            "lid":
            {
                "value":"*pdart14_meap*",
                "boost":1.0
            }
        }
    }
}

It is looking like gov.nasa.pds.api.engineering.elasticsearch.Antlr4SearchListener needs a complete overhaul. Before the overhaul (complete rewrite) begins on April 7, 2021 at 1000 hrs PDT, as it will take a week or two, did I miss something or is there a different direction you would like to go?

al-niessner commented 3 years ago

@jordanpadams @tloubrieu-jpl

Need further clarification on what is need for acceptance criteria. This list will grow as new processing is developed (remember the two wildcard choices are ? and *:

  1. x eq *wildcard? should work and covered in acceptance criteria already
  2. x ne *wildcard? should work or fail with error? Does not look like ES directly support this but may be able to fake it with a mustNot instead of must.
  3. x gt *wildcard? should fail with what error? Is 10 > 3* or 10 > 3? when there is a 3 and a 30? Besides, does not look like it is supported by ES.
  4. x ge *wildcard? should fail with what error?
  5. x lt *wildcard? should fail with what error?
  6. x le *wildcard? should fail with what error?

What about ge,gt,le,lt for strings rather than numbers? What about ge,gt,le,lt for booleans?

tloubrieu-jpl commented 3 years ago

Hi Al,

Thanks for raising that. Actually ge, gt on values with wildcard does not make a lot of sense to me.

Given the current query syntax, that would make more sense to introduce a 'like' operator ? What do you think, @al-niessner ? @jordanpadams ?

Would that work more easily for translation into elasticsearch query ?

jordanpadams commented 3 years ago

@al-niessner @tloubrieu-jpl interesting.

Given the current query syntax, that would make more sense to introduce a 'like' operator ?

i'm not sure if I like that or not (pun intended) 🤷 . i feel like it is simpler just to overload eq for now unless we get a specific request to add more operators?

x ne *wildcard? should work or fail with error? Does not look like ES directly support this but may be able to fake it with a mustNot instead of must.

if we can kludge it in, that would be great.

for the others, I agree with Thomas. maybe throw and error that wildcards do not work with this operator (or something like that).

al-niessner commented 3 years ago

@jordanpadams @tloubrieu-jpl

I am looking for acceptance criteria (table) -- what error. 500 for internal server error makes it seem like it was not thought of. I am not sure if you want a 412 or 422 or something else.

There is no need for 'like' as equal is obvious enough.

tloubrieu-jpl commented 3 years ago

@al-niessner @jordanpadams

Ok 'eq' is ok to be overloaded to support wildcards. I am just possibly worrying about 'Antlr4SearchListener needs a complete overhaul'. Does that still apply ? I believe we would just need to add a specific STRINGVAL_WITH_WILDCARD to the NUMBER or STRINGVAL entities already available in the grammar (https://github.com/NASA-PDS/api-search-query-lexer/blob/master/src/main/antlr4/gov/nasa/pds/api/engineering/lexer/Search.g4). We would accept this entity only with eq and ne operator in the grammar. The lexer code should still be generated automatically as for now. Of course the Antlr4SearchListener will need update but hopefully not being completely overhauled unless I missed something.

For 'ne' operator, it already translates in ES query into NOT (x eq y) so that should not be too difficult to add the wildcard support for it as well.

At last, when wildcard are not supported, I think 400 error (bad request) is good enough for what we want to report. I think that would be handled as-is if the grammar is updated properly.

@al-niessner we can have a quick call is that helps.

Thanks,

Thomas

al-niessner commented 3 years ago

@tloubrieu-jpl

Rewrite is under way. All the booleans are wrong as are the terms (should be match). If you want to find a value in a list (term) then we should add an operator IN.

tloubrieu-jpl commented 3 years ago

@al-niessner can I call you ?

I am thinking the current behavior is ok when the value to compare with does not contains wildcards.

If you update the grammar with a new value type having wildcard, you just have to add a new case in the Antlr4SearchListener to use 'match' instead of what we have for standard values (which works fine). I am worrying about having to retest the full query listener...

al-niessner commented 3 years ago

@tloubrieu-jpl

No need to update grammar at this point. No, current behavior of listener is not good. Need to be using WildcardQueryBuilder, RangeQueryBuilder, and RangeQueryBuilder. Maybe at 1330 hrs.

tloubrieu-jpl commented 3 years ago

@al-niessner that sounds good. I think I just need a good explanation. I'll setup a webex at 13:30.