MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Validation Scenarios based on ES query / mapped fields #239

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

Because ES indexing runs before validation scenarios are run, it is possible to have Validation Scenarios that run based on mapped fields from a Job.

One form for this might be an ES query that would result in an ES set that could be matched against DB records. Seems like the most intuitive route would be any records that are returned from the ES query would be considered valid, all others would be invalid.

However, thinking through this more, Validations need some kind of test name, which might suggest a slightly more complex JSON structure for the Validation payload. If that's the case, users could include a flag that would determine if matched Records are valid or invalid. Something akin to:

[
    {
      "test_name":"foo exists",
      "matches":"valid",
      "es_query":{
        "query":{
          "exists":{
            "field":"foo"
          }
        }
      }
    },
    {
      "test_name":"bar does not equal 'baz'",
      "matches":"invalid",
      "es_query":{
        "query":{
          "match":{
            "bar.keyword":"baz"
          }
        }
      }
    }
]
ghukill commented 6 years ago

This would also benefit from a location to test / build ES queries:

If this location will support saving / loading ES queries, there are now two different kinds of ES queries that would need differentiation:

ghukill commented 6 years ago

Proposing removing the key matches from validation JSON, as this essentially doubles the logic in Spark.

Instead, assume that matches to queries are valid, and require users to write queries for validation tests that do or do not match.

ghukill commented 6 years ago

Scratch that, proposing leaving in. The difference is only a leftsemi vs leftanti join in Spark, and allows users writing these Validations to have that option at their disposal. It has the semantic benefit of contextualizing the nature of the query and test message.

ghukill commented 6 years ago

Working in esqueryval branch.

Todo:

ghukill commented 6 years ago

Finis!