eubinecto / youtora

Search YouTube videos like you search books
4 stars 0 forks source link

improve elasticsearch query with n-grams search #169

Open eubinecto opened 4 years ago

eubinecto commented 4 years ago

why?

proximity and order matters when searching for idioms.

the one which is ranked higher should be ranked lower than the one below
image

how?

Are there any functions which incentivize order & proximity?

eubinecto commented 4 years ago

Study: Full-text search queries in elasticsearch

The goal

What query do I want to construct?

  1. takes proximity of the terms into account
  2. takes order of the terms into account (because I'm searching for use cases of an expression)
  3. allows partial match (with configurable threshold) -> the first priority
  4. auto generate synonyms for query terms
  5. does lemmatisation or stemming to match terms whose lemmas are the same

Priority: a query that is capable of doing : 1, 2, 3. As for the rest, think about how I could do this later.

very close to what I want - Intervals query (but with one problem)

intervals query: A full text query that allows fine-grained control of the ordering and proximity of matching terms.

So yeah, with intervals query, I can do 1 and 2.

Here is an example request:

GET general_idx/_search
{
  "query": {
    "intervals": {
      "context": {
        "match": {
          "query": "have one's cake and eat it too",
          "max_gaps": 3,
          "ordered": true
        }
      }
    }
  }
}

The request is quite straightforward. max_gaps parameter is for proximity, ordered parameter is for ordering.

But one problem with this is that it does not allow partial match, yet partial match is among the three must.

Here is an example that illustrates this problem:

no partial match for intervals query partial match supported by match query
image image
image image

with intervals query, It's great that I can have fine control over proximity and order of terms, but I need a query that can do something like:

have one's cake and eat it too would match, for example, have (whatever terms)cake(whatever terms) too as well. That is, a document matches as long as some percentage of the query terms exists in the document.

What about joining intervals and match?

intervals can do 1 and 2. match can do 3. Then I could simply join them with must to get the best of both worlds?

No that wouldn't work, because the results of match would be ranked regardless of proximity between and order of terms.

Is there a way I can do something like minimum_should_match in intervals query?

Okay, since intervals query already does 1 and 2, it is more plausible to come up with a way to do 3 with intervals query than figuring out how to do 1 and 2 with match query.

What rules do we have for intervals query? We have match, prefix, wildcard, fuzzy, all_of , any_of.

Let's see if any of these rules (or a combination of these rules) could do something similar to minimum_should_match.

constructing N-grams should do the trick?

So that's what prof. Nenadic suggested me. If you want to search for have one's cake and eat it too then break it down to, e.g. 2-grams like so:

okay. Let's try doing this. Can I do this with a intervals query?

N-grams in intervals query

I tried the following query:

GET general_idx/_search
{
  "query": {
    "intervals": {
      "context": {
        "any_of": {
          "intervals": [
            {
              "match": {
                "query": "have one's",
               "max_gaps": 3,
               "ordered": true
              }
            },
            {
              "match": {
                "query": "one's cake",
               "max_gaps": 3,
               "ordered": true
              }
            },
            {
              "match": {
                "query": "cake and",
               "max_gaps": 3,
               "ordered": true
              }
            },
            {
              "match": {
                "query": "and eat",
               "max_gaps": 3,
               "ordered": true
              }
            },
            {
              "match": {
                "query": "eat it",
               "max_gaps": 3,
               "ordered": true
              }
            },
            {
              "match": {
                "query": "it too",
               "max_gaps": 3,
               "ordered": true
              }
            }
          ]
        }
      }
    }
  }
}

and this is the result:

{
  "took" : 46,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 0.84615386,
    "hits" : [
      {
        "_index" : "general_idx",
        "_type" : "_doc",
        "_id" : "VmKN7MWrLJs|auto|en|6847893640293742646",
        "_score" : 0.84615386,
        "_source" : {
          "start" : 9.0,
          "duration" : 4.04,
          "content" : "eat it later",
          "prev_id" : "VmKN7MWrLJs|auto|en|-5714611046218093991",
          "next_id" : "VmKN7MWrLJs|auto|en|-7648022219454532643",
          "context" : "and then eat it and then poop it out and eat it later like I'm take the plastic off and eat it",
          "caption" : {
            "id" : "VmKN7MWrLJs|auto|en",
            "is_auto" : true,
            "lang_code" : "en",
            "video" : {
              "id" : "VmKN7MWrLJs",
              "views" : 40801,
              "title" : "Best of the Week - March 29, 2015 - Joe Rogan Experience",
              "publish_date_int" : "20150405",
              "category" : "People & Blogs",
              "channel" : {
                "id" : "UCzQUP1qoWDoEbmsQxvdjxgQ",
                "subs" : 9880000,
                "lang_code" : "en"
              }
            }
          }
        }
      },
      {
        "_index" : "general_idx",
        "_type" : "_doc",
        "_id" : "8ylL8YIs7C0|auto|en|-6040570853495306226",
        "_score" : 0.8,
        "_source" : {
          "start" : 6515.28,
          "duration" : 2.58,
          "content" : "but you kind of get to have your cake",
          "prev_id" : "8ylL8YIs7C0|auto|en|5036019337676617779",
          "next_id" : "8ylL8YIs7C0|auto|en|-6573769737227039838",
          "context" : "cellular cleanup cellular auto feature but you kind of get to have your cake and eat it too because you have a bunch",
          "caption" : {
            "id" : "8ylL8YIs7C0|auto|en",
            "is_auto" : true,
            "lang_code" : "en",
            "video" : {
              "id" : "8ylL8YIs7C0",
              "views" : 1200341,
              "title" : "Joe Rogan Experience #1235 - Ben Greenfield",
              "publish_date_int" : "20190130",
              "category" : "People & Blogs",
              "channel" : {
                "id" : "UCzQUP1qoWDoEbmsQxvdjxgQ",
                "subs" : 9880000,
                "lang_code" : "en"
              }
            }
          }
        }
      },
...

where,

well, I've got close to the solution, but A should be ranked significantly lower than B and C, as it is clearly not a use case of the idiom we're looking for. (no reference to cake).

So, how could we fix this then?

If you increase N to 3, you wouldn't benefit that much from N-gram search, since most of the idioms have length of less than 3-4. ~N should be fixed at 2.~

well, we could also try varying N, depending on the length of the phrase.

or, I think we could limit the number of matches to 1? (with the best score). But applying strict rules to algorithms are generally bad.(does not scale well to other contexts).

Can I get use of filter paramter? Well.. again, rigid rules are usually bad.

Well this works like a charm!:

GET general_idx/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "intervals": {
            "context": {
              "match": {
                "query": "have one's",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        },
        {
          "intervals": {
            "context": {
              "match": {
                "query": "one's cake",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        },{
          "intervals": {
            "context": {
              "match": {
                "query": "cake and",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        },{
          "intervals": {
            "context": {
              "match": {
                "query": "and eat",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        },{
          "intervals": {
            "context": {
              "match": {
                "query": "eat it",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        },{
          "intervals": {
            "context": {
              "match": {
                "query": "it too",
                "max_gaps": 2,
                "ordered": true
              } 
            }
          } 
        }
      ],
      "minimum_should_match": "75%"
    }
  }
}
eubinecto commented 4 years ago

Still, one more problem

e.g. If you want to search for "stand by my point", "stand by your point" won't match.s

could make it matched by substituting pronouns with alternatives. (my -> one's, her, his, their, etc). Think about doing this later.