Closed jordanpadams closed 3 years ago
@tloubrieu-jpl FYI, i descoped NASA-PDS/pds-api#48 for this capability since this is important for both the user community to play around with, but it is also one workaround for https://github.com/NASA-PDS/pds-deep-archive/issues/7
@jordanpadams @tloubrieu-jpl
What repo does this api-search-query-lexer-0.0.1-SNAPSHOT.jar
come from? I guess this can be abstracted into given a dependency, how do I find the repo? For the wildcard, death occurs in this jarfile because of the token parsing of the query lid eq *text*
from a seemingly infinite loop as the JVM runs out of memory while growing an Array. I need into the code to see why the code does not just get 3 tokens.
Found it. github search did not find it well for some reason.
@al-niessner @jordanpadams sorry I missed the comments on that earlier.
@al-niessner I can help here since I developped the antlr parser. I would suggest first to test hardcoded wildcard in elasticsearch before we figure out how to parse the API query and create the related elasticsearch request.
@tloubrieu-jpl @jordanpadams
Moving into a holding pattern. Would be nice if issue 4 was completed and merged into master.
Have a bad work-around in the registry-api-service to handle quotes. Noticed that gov.nasa.pds.api.engineering.elasticsearch.Antlr4SearchListener strips the quotes as well. This means that removing them requires a big of run/debug to find all the places that are already trying to manage this problem.
Lastly, successfully have the query being processed in the registry-api-service, but it returns nothing when I am pretty sure it should. I need to spend time with ES to see if the search works in general and if not why.
Here is the search JSON:
SearchRequest{searchType=QUERY_THEN_FETCH, indices=[registry], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=null, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={"from":0,"size":100,"timeout":"60s","query":{"bool":{"must":[{"bool":{"must":[{"bool":{"must":[{"term":
{"lid":{"value":"*pdart14_meap*","boost":1.0}}
}],"adjust_pure_negative":true,"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}}}}
@al-niessner I have created a pull request for new development (improved pagination, field exists criteria). The branch was based on issue_4 on which I corrected a bug.
You can review it
@tloubrieu-jpl @jordanpadams add whoever else you think needs to read this.
The wildcard search is not working as this ticket suggests. Some tweaks were needed just to get the wildcard request from the input to the code that translates the URL language to elastic search JSON. The idea is to find the query in the acceptance criteria above of GET /products?q=lid eq "*my_bundle*"
which IRL testing has to look like this GET /products?q="lid%20eq%20%22*pdart14_meap*%22"
. Sorry about %22
but they are currently necessary and will be removed in the future and probably because of responses to this issue.
Given the IRL testing query, ElasticSearchRegistrySearchRequestBuilder.getSearchProductsRequest() produces this elastic search JSON query (sorry about the size but is how the query language works):
{
"from":0,
"size":100,
"timeout":"60s",
"query":
{
"bool":
{
"must":[
{
"bool":
{
"must":[
{
"bool":
{
"must":[
{
"term":
{
"lid":
{
"value":"*pdart14_meap*",
"boost":1.0
}
}
}],
"adjust_pure_negative":true,
"boost":1.0
}
}],
"adjust_pure_negative":true,
"boost":1.0
}
}],
"adjust_pure_negative":true,
"boost":1.0
}
}
}
The problem I have is that it does not work. The shard is a success for all that is worth and there are no failures, which adds no value, because the hits are 0. Digging into elastic search, it turns out the correct JSON query for the IRL testing case that returns 17 hits is:
{
"from":0,
"size":100,
"timeout":"60s",
"query":
{
"wildcard":
{
"lid":
{
"value":"*pdart14_meap*",
"boost":1.0
}
}
}
}
It is looking like gov.nasa.pds.api.engineering.elasticsearch.Antlr4SearchListener needs a complete overhaul. Before the overhaul (complete rewrite) begins on April 7, 2021 at 1000 hrs PDT, as it will take a week or two, did I miss something or is there a different direction you would like to go?
@jordanpadams @tloubrieu-jpl
Need further clarification on what is need for acceptance criteria. This list will grow as new processing is developed (remember the two wildcard choices are ?
and *
:
x eq *wildcard?
should work and covered in acceptance criteria alreadyx ne *wildcard?
should work or fail with error? Does not look like ES directly support this but may be able to fake it with a mustNot
instead of must
.x gt *wildcard?
should fail with what error? Is 10 > 3* or 10 > 3? when there is a 3 and a 30? Besides, does not look like it is supported by ES.x ge *wildcard?
should fail with what error?x lt *wildcard?
should fail with what error?x le *wildcard?
should fail with what error?What about ge,gt,le,lt for strings rather than numbers? What about ge,gt,le,lt for booleans?
Hi Al,
Thanks for raising that. Actually ge, gt on values with wildcard does not make a lot of sense to me.
Given the current query syntax, that would make more sense to introduce a 'like' operator ? What do you think, @al-niessner ? @jordanpadams ?
Would that work more easily for translation into elasticsearch query ?
@al-niessner @tloubrieu-jpl interesting.
Given the current query syntax, that would make more sense to introduce a 'like' operator ?
i'm not sure if I like that or not (pun intended) 🤷 . i feel like it is simpler just to overload eq
for now unless we get a specific request to add more operators?
x ne *wildcard? should work or fail with error? Does not look like ES directly support this but may be able to fake it with a mustNot instead of must.
if we can kludge it in, that would be great.
for the others, I agree with Thomas. maybe throw and error that wildcards do not work with this operator (or something like that).
@jordanpadams @tloubrieu-jpl
I am looking for acceptance criteria (table) -- what error. 500 for internal server error makes it seem like it was not thought of. I am not sure if you want a 412 or 422 or something else.
There is no need for 'like' as equal is obvious enough.
@al-niessner @jordanpadams
Ok 'eq' is ok to be overloaded to support wildcards. I am just possibly worrying about 'Antlr4SearchListener needs a complete overhaul'. Does that still apply ? I believe we would just need to add a specific STRINGVAL_WITH_WILDCARD to the NUMBER or STRINGVAL entities already available in the grammar (https://github.com/NASA-PDS/api-search-query-lexer/blob/master/src/main/antlr4/gov/nasa/pds/api/engineering/lexer/Search.g4). We would accept this entity only with eq and ne operator in the grammar. The lexer code should still be generated automatically as for now. Of course the Antlr4SearchListener will need update but hopefully not being completely overhauled unless I missed something.
For 'ne' operator, it already translates in ES query into NOT (x eq y) so that should not be too difficult to add the wildcard support for it as well.
At last, when wildcard are not supported, I think 400 error (bad request) is good enough for what we want to report. I think that would be handled as-is if the grammar is updated properly.
@al-niessner we can have a quick call is that helps.
Thanks,
Thomas
@tloubrieu-jpl
Rewrite is under way. All the booleans are wrong as are the terms (should be match). If you want to find a value in a list (term) then we should add an operator IN.
@al-niessner can I call you ?
I am thinking the current behavior is ok when the value to compare with does not contains wildcards.
If you update the grammar with a new value type having wildcard, you just have to add a new case in the Antlr4SearchListener to use 'match' instead of what we have for standard values (which works fine). I am worrying about having to retest the full query listener...
@tloubrieu-jpl
No need to update grammar at this point. No, current behavior of listener is not good. Need to be using WildcardQueryBuilder, RangeQueryBuilder, and RangeQueryBuilder. Maybe at 1330 hrs.
@al-niessner that sounds good. I think I just need a good explanation. I'll setup a webex at 13:30.
For more information on how to populate this new feature request, see the PDS Wiki on User Story Development:
https://github.com/NASA-PDS/nasa-pds.github.io/wiki/Issue-Tracking#user-story-development
Motivation
...so that I can more easily search for a product without knowing the entire LID
Additional Details
As stated in the PDS API Spec:
Must also use
like
operator to search for the wildcard.See specification https://docs.google.com/document/d/16d0MVh48bFLvWsa5-B_Hy-cby1rGWdnNojWOJpUcOvA/edit#heading=h.dihrtzltiwag
Acceptance Criteria
Given There is a product with some LID like
urn:nasa:pds:my_bundle:my_collection:my_product
When I perform A query against the the products endpoint likeGET /products?q=lid like "*my_bundle*"
Then I expect the API results to include, at minimum the product metadata forurn:nasa:pds:my_bundle:my_collection:my_product
(and any other products that containmy_bundle
in their lid.Given There is a product with some title like
this is my product foobar
When I perform A query against the the products endpoint likeGET /products?q=title like "*foobar*"
Then I expect the API results to include, at minimum the product metadata for the product with the titlethis is my product foobar
(and any other products that containfoobar
in their title