Open eubinecto opened 4 years ago
What query do I want to construct?
Priority: a query that is capable of doing : 1, 2, 3. As for the rest, think about how I could do this later.
Intervals
query (but with one problem)
intervals
query: A full text query that allows fine-grained control of the ordering and proximity of matching terms.
So yeah, with intervals
query, I can do 1 and 2.
Here is an example request:
GET general_idx/_search
{
"query": {
"intervals": {
"context": {
"match": {
"query": "have one's cake and eat it too",
"max_gaps": 3,
"ordered": true
}
}
}
}
}
The request is quite straightforward. max_gaps
parameter is for proximity, ordered
parameter is for ordering.
But one problem with this is that it does not allow partial match, yet partial match is among the three must.
Here is an example that illustrates this problem:
no partial match for intervals query |
partial match supported by match query |
---|---|
with intervals
query, It's great that I can have fine control over proximity
and order
of terms, but I need a query that can do something like:
have one's cake and eat it too
would match, for example, have
(whatever terms)cake
(whatever terms) too
as well. That is, a document matches as long as some percentage of the query terms exists in the document.
intervals
and match
?intervals
can do 1 and 2. match
can do 3. Then I could simply join them with must
to get the best of both worlds?
No that wouldn't work, because the results of match
would be ranked regardless of proximity between and order of terms.
minimum_should_match
in intervals
query?Okay, since intervals
query already does 1 and 2, it is more plausible to come up with a way to do 3 with intervals
query than figuring out how to do 1 and 2 with match
query.
What rules
do we have for intervals
query? We have match
, prefix
, wildcard
, fuzzy
, all_of
, any_of
.
Let's see if any of these rules (or a combination of these rules) could do something similar to minimum_should_match
.
So that's what prof. Nenadic suggested me. If you want to search for have one's cake and eat it too
then break it down to, e.g. 2-grams like so:
have one's
one's cake
cake and
and eat
eat it
it too
.okay. Let's try doing this. Can I do this with a intervals
query?
intervals
queryI tried the following query:
GET general_idx/_search
{
"query": {
"intervals": {
"context": {
"any_of": {
"intervals": [
{
"match": {
"query": "have one's",
"max_gaps": 3,
"ordered": true
}
},
{
"match": {
"query": "one's cake",
"max_gaps": 3,
"ordered": true
}
},
{
"match": {
"query": "cake and",
"max_gaps": 3,
"ordered": true
}
},
{
"match": {
"query": "and eat",
"max_gaps": 3,
"ordered": true
}
},
{
"match": {
"query": "eat it",
"max_gaps": 3,
"ordered": true
}
},
{
"match": {
"query": "it too",
"max_gaps": 3,
"ordered": true
}
}
]
}
}
}
}
}
and this is the result:
{
"took" : 46,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 0.84615386,
"hits" : [
{
"_index" : "general_idx",
"_type" : "_doc",
"_id" : "VmKN7MWrLJs|auto|en|6847893640293742646",
"_score" : 0.84615386,
"_source" : {
"start" : 9.0,
"duration" : 4.04,
"content" : "eat it later",
"prev_id" : "VmKN7MWrLJs|auto|en|-5714611046218093991",
"next_id" : "VmKN7MWrLJs|auto|en|-7648022219454532643",
"context" : "and then eat it and then poop it out and eat it later like I'm take the plastic off and eat it",
"caption" : {
"id" : "VmKN7MWrLJs|auto|en",
"is_auto" : true,
"lang_code" : "en",
"video" : {
"id" : "VmKN7MWrLJs",
"views" : 40801,
"title" : "Best of the Week - March 29, 2015 - Joe Rogan Experience",
"publish_date_int" : "20150405",
"category" : "People & Blogs",
"channel" : {
"id" : "UCzQUP1qoWDoEbmsQxvdjxgQ",
"subs" : 9880000,
"lang_code" : "en"
}
}
}
}
},
{
"_index" : "general_idx",
"_type" : "_doc",
"_id" : "8ylL8YIs7C0|auto|en|-6040570853495306226",
"_score" : 0.8,
"_source" : {
"start" : 6515.28,
"duration" : 2.58,
"content" : "but you kind of get to have your cake",
"prev_id" : "8ylL8YIs7C0|auto|en|5036019337676617779",
"next_id" : "8ylL8YIs7C0|auto|en|-6573769737227039838",
"context" : "cellular cleanup cellular auto feature but you kind of get to have your cake and eat it too because you have a bunch",
"caption" : {
"id" : "8ylL8YIs7C0|auto|en",
"is_auto" : true,
"lang_code" : "en",
"video" : {
"id" : "8ylL8YIs7C0",
"views" : 1200341,
"title" : "Joe Rogan Experience #1235 - Ben Greenfield",
"publish_date_int" : "20190130",
"category" : "People & Blogs",
"channel" : {
"id" : "UCzQUP1qoWDoEbmsQxvdjxgQ",
"subs" : 9880000,
"lang_code" : "en"
}
}
}
}
},
...
where,
and then **eat it** and then poop it out and **eat it** later like I'm take the plastic off and **eat it**
(0.84)
-B: 2nd place: cellular cleanup cellular auto feature but you kind of get to have your cake and eat it too because you have a bunch
. (0.8)
-C: 3rd place: but you kind of get to have your cake and eat it too because you have a bunch of calories at the end of that are you
(0.8)well, I've got close to the solution, but A should be ranked significantly lower than B and C, as it is clearly not a use case of the idiom we're looking for. (no reference to cake
).
So, how could we fix this then?
If you increase N to 3, you wouldn't benefit that much from N-gram search, since most of the idioms have length of less than 3-4. ~N should be fixed at 2.~
well, we could also try varying N, depending on the length of the phrase.
or, I think we could limit the number of matches to 1? (with the best score). But applying strict rules to algorithms are generally bad.(does not scale well to other contexts).
Can I get use of filter
paramter? Well.. again, rigid rules are usually bad.
Well this works like a charm!:
GET general_idx/_search
{
"query": {
"bool": {
"should": [
{
"intervals": {
"context": {
"match": {
"query": "have one's",
"max_gaps": 2,
"ordered": true
}
}
}
},
{
"intervals": {
"context": {
"match": {
"query": "one's cake",
"max_gaps": 2,
"ordered": true
}
}
}
},{
"intervals": {
"context": {
"match": {
"query": "cake and",
"max_gaps": 2,
"ordered": true
}
}
}
},{
"intervals": {
"context": {
"match": {
"query": "and eat",
"max_gaps": 2,
"ordered": true
}
}
}
},{
"intervals": {
"context": {
"match": {
"query": "eat it",
"max_gaps": 2,
"ordered": true
}
}
}
},{
"intervals": {
"context": {
"match": {
"query": "it too",
"max_gaps": 2,
"ordered": true
}
}
}
}
],
"minimum_should_match": "75%"
}
}
}
e.g. If you want to search for "stand by my point", "stand by your point" won't match.s
could make it matched by substituting pronouns with alternatives. (my -> one's, her, his, their, etc). Think about doing this later.
why?
proximity and order matters when searching for idioms.
how?
Are there any functions which incentivize order & proximity?