Study: Full-text search queries in elasticsearch

The goal

What query do I want to construct?

takes proximity of the terms into account
takes order of the terms into account (because I'm searching for use cases of an expression)
allows partial match (with configurable threshold) -> the first priority
auto generate synonyms for query terms
does lemmatisation or stemming to match terms whose lemmas are the same

Priority: a query that is capable of doing : 1, 2, 3. As for the rest, think about how I could do this later.

very close to what I want - `Intervals` query (but with one problem)

intervals query: A full text query that allows fine-grained control of the ordering and proximity of matching terms.

So yeah, with intervals query, I can do 1 and 2.

Here is an example request:

GET general_idx/_search
{
  "query": {
    "intervals": {
      "context": {
        "match": {
          "query": "have one's cake and eat it too",
          "max_gaps": 3,
          "ordered": true
        }
      }
    }
  }
}

The request is quite straightforward. max_gaps parameter is for proximity, ordered parameter is for ordering.

But one problem with this is that it does not allow partial match, yet partial match is among the three must.

Here is an example that illustrates this problem:

no partial match for `intervals` query	partial match supported by `match` query

with intervals query, It's great that I can have fine control over proximity and order of terms, but I need a query that can do something like:

have one's cake and eat it too would match, for example, have (whatever terms)cake(whatever terms) too as well. That is, a document matches as long as some percentage of the query terms exists in the document.

What about joining `intervals` and `match`?

intervals can do 1 and 2. match can do 3. Then I could simply join them with must to get the best of both worlds?

No that wouldn't work, because the results of match would be ranked regardless of proximity between and order of terms.

Is there a way I can do something like `minimum_should_match` in `intervals` query?

Okay, since intervals query already does 1 and 2, it is more plausible to come up with a way to do 3 with intervals query than figuring out how to do 1 and 2 with match query.

What rules do we have for intervals query? We have match, prefix, wildcard, fuzzy, all_of , any_of.

Let's see if any of these rules (or a combination of these rules) could do something similar to minimum_should_match.

constructing N-grams should do the trick?

So that's what prof. Nenadic suggested me. If you want to search for have one's cake and eat it too then break it down to, e.g. 2-grams like so:

have one's
- followed by: one's cake
- followed by: cake and
- followed by: and eat
- followed by: eat it
- followed by: it too.

okay. Let's try doing this. Can I do this with a intervals query?

N-grams in `intervals` query

I tried the following query:

GET general_idx/_search
{
  "query": {
    "intervals": {
      "context": {
        "any_of": {
          "intervals": [
            {
              "match": {
                "query": "have one's",
               "max_gaps": 3,
               "ordered": true
              }
            },
            {
              "match": {
                "query": "one's cake",
               "max_gaps": 3,
               "ordered": true
              }
            },
            {
              "match": {
                "query": "cake and",
               "max_gaps": 3,
               "ordered": true
              }
            },
            {
              "match": {
                "query": "and eat",
               "max_gaps": 3,
               "ordered": true
              }
            },
            {
              "match": {
                "query": "eat it",
               "max_gaps": 3,
               "ordered": true
              }
            },
            {
              "match": {
                "query": "it too",
               "max_gaps": 3,
               "ordered": true
              }
            }
          ]
        }
      }
    }
  }
}

and this is the result:

{
  "took" : 46,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 0.84615386,
    "hits" : [
      {
        "_index" : "general_idx",
        "_type" : "_doc",
        "_id" : "VmKN7MWrLJs|auto|en|6847893640293742646",
        "_score" : 0.84615386,
        "_source" : {
          "start" : 9.0,
          "duration" : 4.04,
          "content" : "eat it later",
          "prev_id" : "VmKN7MWrLJs|auto|en|-5714611046218093991",
          "next_id" : "VmKN7MWrLJs|auto|en|-7648022219454532643",
          "context" : "and then eat it and then poop it out and eat it later like I'm take the plastic off and eat it",
          "caption" : {
            "id" : "VmKN7MWrLJs|auto|en",
            "is_auto" : true,
            "lang_code" : "en",
            "video" : {
              "id" : "VmKN7MWrLJs",
              "views" : 40801,
              "title" : "Best of the Week - March 29, 2015 - Joe Rogan Experience",
              "publish_date_int" : "20150405",
              "category" : "People & Blogs",
              "channel" : {
                "id" : "UCzQUP1qoWDoEbmsQxvdjxgQ",
                "subs" : 9880000,
                "lang_code" : "en"
              }
            }
          }
        }
      },
      {
        "_index" : "general_idx",
        "_type" : "_doc",
        "_id" : "8ylL8YIs7C0|auto|en|-6040570853495306226",
        "_score" : 0.8,
        "_source" : {
          "start" : 6515.28,
          "duration" : 2.58,
          "content" : "but you kind of get to have your cake",
          "prev_id" : "8ylL8YIs7C0|auto|en|5036019337676617779",
          "next_id" : "8ylL8YIs7C0|auto|en|-6573769737227039838",
          "context" : "cellular cleanup cellular auto feature but you kind of get to have your cake and eat it too because you have a bunch",
          "caption" : {
            "id" : "8ylL8YIs7C0|auto|en",
            "is_auto" : true,
            "lang_code" : "en",
            "video" : {
              "id" : "8ylL8YIs7C0",
              "views" : 1200341,
              "title" : "Joe Rogan Experience #1235 - Ben Greenfield",
              "publish_date_int" : "20190130",
              "category" : "People & Blogs",
              "channel" : {
                "id" : "UCzQUP1qoWDoEbmsQxvdjxgQ",
                "subs" : 9880000,
                "lang_code" : "en"
              }
            }
          }
        }
      },
...

where,

A: the most highly ranked document was: and then **eat it** and then poop it out and **eat it** later like I'm take the plastic off and **eat it** (0.84) -B: 2nd place: cellular cleanup cellular auto feature but you kind of get to have your cake and eat it too because you have a bunch. (0.8) -C: 3rd place: but you kind of get to have your cake and eat it too because you have a bunch of calories at the end of that are you (0.8)

well, I've got close to the solution, but A should be ranked significantly lower than B and C, as it is clearly not a use case of the idiom we're looking for. (no reference to cake).

So, how could we fix this then?

If you increase N to 3, you wouldn't benefit that much from N-gram search, since most of the idioms have length of less than 3-4. ~N should be fixed at 2.~

well, we could also try varying N, depending on the length of the phrase.

len(phrase) < 5: N = 2
5 <= len(phrase): N = 3

or, I think we could limit the number of matches to 1? (with the best score). But applying strict rules to algorithms are generally bad.(does not scale well to other contexts).

Can I get use of filter paramter? Well.. again, rigid rules are usually bad.

Well this works like a charm!:

GET general_idx/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "intervals": {
            "context": {
              "match": {
                "query": "have one's",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        },
        {
          "intervals": {
            "context": {
              "match": {
                "query": "one's cake",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        },{
          "intervals": {
            "context": {
              "match": {
                "query": "cake and",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        },{
          "intervals": {
            "context": {
              "match": {
                "query": "and eat",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        },{
          "intervals": {
            "context": {
              "match": {
                "query": "eat it",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        },{
          "intervals": {
            "context": {
              "match": {
                "query": "it too",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        }
      ],
      "minimum_should_match": "75%"
    }
  }
}

eubinecto / youtora

improve elasticsearch query with n-grams search #169

why?

how?

Study: Full-text search queries in elasticsearch

The goal

very close to what I want - `Intervals` query (but with one problem)

What about joining `intervals` and `match`?

Is there a way I can do something like `minimum_should_match` in `intervals` query?

constructing N-grams should do the trick?

N-grams in `intervals` query

Still, one more problem

eubinecto / youtora

improve elasticsearch query with n-grams search #169

why?

how?

Study: Full-text search queries in elasticsearch

The goal

very close to what I want - Intervals query (but with one problem)

What about joining intervals and match?

Is there a way I can do something like minimum_should_match in intervals query?

constructing N-grams should do the trick?

N-grams in intervals query

Still, one more problem

very close to what I want - `Intervals` query (but with one problem)

What about joining `intervals` and `match`?

Is there a way I can do something like `minimum_should_match` in `intervals` query?

N-grams in `intervals` query