elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.81k stars 24.7k forks source link

[NLP] Simplify configuration for NLP models at inference #97042

Closed davidkyle closed 5 months ago

davidkyle commented 1 year ago

Description

Early in the development process there was a design decision that NLP models and Boosted Tree models should use the same APIs, both for convenience and so that users don't have to learn new APIs. However, Boosted tree models are different in that they take any number of inputs and that requires extra configuration. Configuring NLP models in ingest pipelines has inherited a complexity from Boosted tree models that is redundant and difficult to get started with.

Any changes must be backwards compatible, the current configuration options would still be supported along with any enhancements.

Ingest Pipelines

Take an ingest pipeline configured for ELSER

PUT _ingest/pipeline/elser
{
  "processors": [
    {
      "inference": {
        "model_id": ".elser_model_1",
        "field_map": {
          "body": "text_field" <1>
        },
        "target_field": "ml", <2>
        "inference_config": {
          "text_expansion": { 
            "results_field": "tokens" <3>
          }
        }
      }
    }
  ]
}

1. field_map

The NLP model is configured with an input field name when it is created. This cannot be changed at inference so the user must map the name of the field in the input document to the expected input field name. Conventionally this is named text_field if Eland is used to upload the model but it may be something else in which case the model configuration must be checked to figure out what the field should be called.

field_map makes more sense for boosted tree models where there are multiple inputs. For NLP the input is a single field which could be set simply with:

input_field: body_text

Users often get the map the wrong way round as there is poor intuition about which way the mapping goes.

2. target_field

It is not obvious that the results will be written to the concatenation of target_field and results_field where target_field is the top level object. target_field cannot be null or empty so there is no way to put the results in the root of the document. The default value is ml.inference so the results will be written to ml.inference.<results_field> if not set.

Add the ability to set a null or empty target_field so that results can go into the root of the document.

3. results_field

The results_fields is specified 2 levels down inside 2 nested objects:

        "inference_config": {
          "text_expansion": {   <-- task type
            "results_field": "tokens" 
          }
        }

If the task type (in this case text_expansion) is wrong an error will be returned. This makes it hard to reuse configurations as the correct task type must be found just to set results_field. This setting that is common to both Boosted tree and NLP models and could be lifted out of the inference_config.

The default value of results_field is predicted_value so setting this field is not strictly required.

A Simpler Option

Applying these ideas the config would be less verbose and less error prone. Reusing this config in many cases would be a matter of changing the model_id or input_field.

PUT _ingest/pipeline/elser
{
  "processors": [
    {
      "inference": {
        "model_id": ".elser_model_1",
        "input_field": "body_text",
        "target_field": "ml", 
        "results_field": "tokens"        
      }
    }
  ]
}

Or if the user is happy with the default target & result field names:

PUT _ingest/pipeline/elser
{
  "processors": [
    {
      "inference": {
        "model_id": ".elser_model_1",
        "input_field": "body_text"
      }
    }
  ]
}

Which would write results to ml.inference.predicted_value.

_infer API

The _infer API accepts an array of docs, each doc should contain a field named the same as the expected input field, as above conventionally this is text_field.

The example for ELSER is:

POST _ml/trained_models/.elser_model_1/_infer
{
  "docs": [{"text_field": "the text to analyse"}]
}

If the correct input field is missing a helpful error is returned:

POST _ml/trained_models/.elser_model_1/_infer
{
  "docs": [{"not_the_field_the_model_expects": "the text to analyse"}]
}

returns:

"Input field [text_field] does not exist in the source document"

NLP Models expect one input not a document. Instead of extracting the field from the docs a simpler option is to specify the input without consideration of the name of the field the model expects in a document.

POST _ml/trained_models/.elser_model_1/_infer
{
  "input_text": ["the text to analyse"]
}

Or for multiple requests:

POST _ml/trained_models/.elser_model_1/_infer
{
  "input_text": ["the text to analyse", "some more", "and another]
}
elasticsearchmachine commented 1 year ago

Pinging @elastic/ml-core (Team:ML)

alyokaz commented 1 year ago

I'd like to take this if possible.

droberts195 commented 1 year ago

@AllyKaz I think this is one we'll handle internally, as it could need a lot of discussion and we might decide to do some bits but not others.

alyokaz commented 1 year ago

@droberts195 Understood. I had a quick look over it regardless, and the first part, at least, seems a pretty straight forward matter of a few changes in the InferenceProcessor.Factory. Seems like it should fit quite neatly into the current tests too.