elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.52k stars 24.34k forks source link

[Learning to rank] Add support to feature variables which is not interact with field data #108404

Open daixque opened 1 month ago

daixque commented 1 month ago

Overview

As of 8.13, the learning to rank functionality of Elasticsearch and Eland only support the feature variable which associate with field data of the Elasticsearch's index.

But sometimes a user may need to train the model with feature values which is provided directly and not as field data. Elasticsearch and Eland should have the capability which accepts feature values is not interact with field data.

For example, our notebook shows how we can implement a search app for movie data. In this example, all feature values are provided by Elasticsearch, such as BM25 score and/or result of script score. But sometimes user wants to train their model with the data which is from outside of Elasticsearch. Typical example would be the user profile such as age and/or gender, etc., because those are not related to the each document (in this case each movie).

Model training with Eland

At the moment LTRModelConfig only accepts list of QueryFeatureExtractor, but in the new version of Eland it should also accept another extractor which represents direct feature value which doesn't associate with any field data of the index.

Elasticsearch learning to rank query

When an application app issues the query, feature values should be directly passed to Elasticsearch. It may look like rescore.learning_to_rank.prams.user_age in the example below:

GET my-index/_search
{
  "query": { 
    "multi_match": {
      "fields": ["title", "content"],
      "query": "the quick brown fox"
    }
  },
  "rescore": {
    "learning_to_rank": {
      "model_id": "ltr-model", 
      "params": { 
        "query_text": "the quick brown fox",
        "user_age": 20
      }
    },
    "window_size": 100 
  }
}
elasticsearchmachine commented 1 month ago

Pinging @elastic/ml-core (Team:ML)

afoucret commented 1 month ago

@daixque As long as the feature is a numeric one, there is a workaround that consists to write a script_score based query feature extractor that return the params directly:

In your case, here is what it may looks likes in eland:

        QueryFeatureExtractor(
            feature_name="user_age",
            query={
                "script_score": {
                    "query": {"match_all": {} },
                    "script": {"source": "return params.user_age"},
                }
            },
        ),
daixque commented 4 weeks ago

Hi @afoucret , thank you for your comment. I'm aware of that kind of workaround can be used, but I feel it's not intuitive (and may not be performant to build training dataset). So it would be great if Eland and Elasticsearch support it natively.