Add optional no_match_value parameter to the enrich processor

dschneiter commented 2 years ago

Description

Enriching documents based on a multi-value match-field (with max_matches parameter set to a value > 0), it is possible that for some values there is a matching entry in the look-up index, whereas for some other values there isn't.

It would be convenient to be able to configure an optional no_match_value and enrich the documents with this value in case there was no entry found in the enrich index.

Use-case:

Having only email addresses available in our main index
Having a look-up policy that matches email addresses with a proper names
Having an entry for knownemployee@elastic.co in our enrich index, but not for doesnotexist@elastic.co

Current behavior:

"_source" : {
  "email_addresses" : [
    "knownemployee@elastic.co",
    "doesnotexist@elastic.co"
  ],
  "enriched_user_info" : [
    {
      "name" : "Known Employee",
      "email" : "knownemployee@elastic.co"
    }
  ]
}

Ideal behaviour:

"_source" : {
  "email_addresses" : [
    "knownemployee@elastic.co",
    "doesnotexist@elastic.co"
  ],
  "enriched_instructor_info" : [
    {
      "name" : "Known Employee",
      "email" : "knownemployee@elastic.co"
    },
    {
      "name" : "Unknown Employee",
      "email" : "doesnotexist@elastic.co"
    }
  ]
},

It would be nice to have the possibility to specify a default value that should be used for enrichment purposes in case the lookup in the enrich index is not successful. Without such an option one would need to complement the enrich processor with a script processor checking for the existence of every single match-value in the enrichment field and in case it's not there adding that value with a default value to the enrichment field. Quite a lot of effort and the complexity to deal with painless for not such a strange and uncommon scenario/use-case

dschneiter commented 2 years ago

I found a workaround to achieve what I want with only a limited amount of painless, but it's still cumbersome and an optional configuration parameter would be much nicer to get to the same result.

My workaround solution:

using a for_each processor in the main pipeline:

  {
    "foreach": {
      "field": "email",
      "processor": {
        "pipeline": {
          "name": "single_enrichment"
        }
      }
    }
  }

and in the single_enrichment pipeline doing the following:

PUT _ingest/pipeline/single_enrichment
{
"processors": [
{
  "set": {
    "field": "tmp.email",
    "value": "{{{_ingest._value}}}"
  }
},
{
  "set": {
    "field": "tmp.name",
    "value": "Unknown Employee"
  }
},
{
  "enrich": {
    "field": "_ingest._value",
    "target_field": "tmp",
    "policy_name": "names_policy",
    "max_matches": 1,
    "ignore_missing": false,
    "override": true
  }
},
{
  "script": {
    "lang": "painless",
    "source": "if (ctx.enrichment == null) ctx.enrichment = []; ctx.enrichment.add(ctx.tmp)"
  }
},
{
  "remove": {
    "field": "tmp"
  }
}
]
}

Comment on workaround approach The tmp object is a temporary object only "living" during this single lookup and it gets cleaned up/removed after every execution of this pipeline. It represents the "same" object as after a successful look-up but initializes the two fields that make up the object with the default values I'd like to see in the object (the email address used for the lookup and the default value for "name" if the lookup was not successful). (Such a value or object could ideally be specified as a non_match_value in the enrich processor).

The workaround then does the actual enrichment step which - if successful - overwrites the tmp object with the values returned from the enrichment step.

Then the whole tmp object gets added to the target field (painless was needed for this step as the append processor would have added a comma separated string representation of the tmp object, rather than the actual JSON object to the target-field enrichment).

elasticmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

elastic / elasticsearch

Add optional no_match_value parameter to the enrich processor #86238

Description