MoonsetJS / Moonset

A data processing framework on top of AWS.
Apache License 2.0
0 stars 1 forks source link

Cross-cutting problems #68

Open wangzhihao opened 4 years ago

wangzhihao commented 4 years ago

We have the need to apply cross-cutting logic like metrics logging, data validation to multiple jobs or workflows. Add them one by one is tedious. To integrate with these logic tightly will also make the code messy and reduce the flexibility.

One approach is to apply Aspect Oriented Programming. So that we can weave into logic orthogonality. Another approach is to define lifecycle hooks to allow custom logic to be plugged in in different phases.

wangzhihao commented 4 years ago

Approach one: [In progress] EDSL & Aspect Oriented Programming

The CLI's goal is comprehensive and unambitious while the EDSL's is to be neat and by convention. We would define some default actions globally, and jobs can override the default behavior when it does not fit. For example:

  1. All jobs should support metrics logging by default.
  2. All jobs should support some basic validations like not-null-check, unique check by default. It can be disabled when not applied.
  3. User should provide only one template which will fan out to different regions.

Approach two: lifecycle hooks & plugins

We define some lifecycle hooks: [input.prev.hook, input.post.hook, output.prev.hook, output.post.hook , task.prev.hook, task.post.hook]. Each hook is an array of functions, and will pass itself as payload when invoking the fucntion. For example, a function f inside input.prev.hook will be invoked as f(input, ...). The following steps illustrate how a plugin loads and executes:

  1. The core package loads the plugin and discover a hook function e.g. f in the link time.
  2. The core package pushes the function into corresponding hook array e.g. input.prev.hook.
  3. For each input target e.g. input and each function f inside its hook array e.g. input.prev.hook, execute f(input, arguments). The arguments is what users pass in the input payload.

One benefit of plugin pattern is that we can separate the logic into different packages. The core package has no knowledge about the logic and the data. It just loads the logic from plugin package and let the plugin package to explain the data.

Here is a sample command with hooks arguments:

npx moonset run \
    --plugin '@moonset/plugin-platform-emr'  \
    --plugin '@moonset/plugin-data-glue' \
    --plugin '@moonset/plugin-validation' \
    --plugin '@moonset/plugin-metrics-logging' \
    --job '{
    "input": [{
        "glue": { 
            "db": "foo", "table": "apple", "partition": {"region_id": "1", "snapshot_date": "2020-01-01"},
            "post-hook": [
                {"type": "logging"},
                {"type": "validation", "arguments": ["not-null", "unique"]}
            ]
        }
    }],
    "task": [{
        "hive": {
           "prev-hook": [
                {"type": "validation", "arguments": ["not-null", "unique"]}
            ],
            "sql": "insert overwrite table foo.pineapple partition (region_id=1, snapshot_date=\"2020-01-01\") select foo from foo.apple;",
            "post-hook": [
                {"type": "logging"}
            ]
        }
    }],
    "output": [{
        "glue": {
            "prev-hook": [
                {"type": "validation", "arguments": ["not-null", "unique"]}
            ],
            "db": "foo", "table": "pineapple", "partition": {"region_id": "1", "snapshot_date": "2020-01-01"},
            "post-hook": [
                {"type": "logging"}
            ]
        }
    }]
}'

Questions: In the current approach two, it will start a new cluster for each validation. How to reuse the cluster?