Cross-cutting problems - Githubissues

Approach one: [In progress] EDSL & Aspect Oriented Programming

The CLI's goal is comprehensive and unambitious while the EDSL's is to be neat and by convention. We would define some default actions globally, and jobs can override the default behavior when it does not fit. For example:

All jobs should support metrics logging by default.
All jobs should support some basic validations like not-null-check, unique check by default. It can be disabled when not applied.
User should provide only one template which will fan out to different regions.

Approach two: lifecycle hooks & plugins

We define some lifecycle hooks: [input.prev.hook, input.post.hook, output.prev.hook, output.post.hook , task.prev.hook, task.post.hook]. Each hook is an array of functions, and will pass itself as payload when invoking the fucntion. For example, a function f inside input.prev.hook will be invoked as f(input, ...). The following steps illustrate how a plugin loads and executes:

The core package loads the plugin and discover a hook function e.g. f in the link time.
The core package pushes the function into corresponding hook array e.g. input.prev.hook.
For each input target e.g. input and each function f inside its hook array e.g. input.prev.hook, execute f(input, arguments). The arguments is what users pass in the input payload.

One benefit of plugin pattern is that we can separate the logic into different packages. The core package has no knowledge about the logic and the data. It just loads the logic from plugin package and let the plugin package to explain the data.

Here is a sample command with hooks arguments:

npx moonset run \
    --plugin '@moonset/plugin-platform-emr'  \
    --plugin '@moonset/plugin-data-glue' \
    --plugin '@moonset/plugin-validation' \
    --plugin '@moonset/plugin-metrics-logging' \
    --job '{
    "input": [{
        "glue": { 
            "db": "foo", "table": "apple", "partition": {"region_id": "1", "snapshot_date": "2020-01-01"},
            "post-hook": [
                {"type": "logging"},
                {"type": "validation", "arguments": ["not-null", "unique"]}
            ]
        }
    }],
    "task": [{
        "hive": {
           "prev-hook": [
                {"type": "validation", "arguments": ["not-null", "unique"]}
            ],
            "sql": "insert overwrite table foo.pineapple partition (region_id=1, snapshot_date=\"2020-01-01\") select foo from foo.apple;",
            "post-hook": [
                {"type": "logging"}
            ]
        }
    }],
    "output": [{
        "glue": {
            "prev-hook": [
                {"type": "validation", "arguments": ["not-null", "unique"]}
            ],
            "db": "foo", "table": "pineapple", "partition": {"region_id": "1", "snapshot_date": "2020-01-01"},
            "post-hook": [
                {"type": "logging"}
            ]
        }
    }]
}'

Questions: In the current approach two, it will start a new cluster for each validation. How to reuse the cluster?

MoonsetJS / Moonset

Cross-cutting problems #68