MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.74k stars 310 forks source link

Add extension point for lineage event validation #2046

Open julienledem opened 2 years ago

julienledem commented 2 years ago

Some users need to be able to validate lineage events. The goal of this feature is to make it easy for them to add validation logic to incoming events and only accept valid ones. Proposal: add a mechanism to add a python HTTP proxy in front of OpenLineage ingestion POST HTTP endpoint

julienledem commented 2 years ago

@mobuchowski Do you have a recommendation on adding a simple python proxy in front of the OL endpoint?

mobuchowski commented 2 years ago

With this requirements I'd just write something based on very popular Python libraries - flask and requests. Something like that:

from flask import Flask
from requests import post, request

app = Flask(__name__)
MARQUEZ_URI = os.getenv('MARQUEZ_URI', 'https://marquez:80/api/v1/lineage')

def validate(event: dict) -> bool:
    ...

@app.route('/api/v1/lineage')
def proxy():
    if validate(request.json):
        return 200, post(f"{MARQUEZ_URI}").content
    return b'', 400

if __name__ == '__main__':
  app.run(host='0.0.0.0', port=8080)
wslulciuc commented 2 years ago

@mobuchowski has a sound solution to standup a proxy in front of the Marquez HTTP API server (listening on POST calls to /api/v1/lineage). I wanted to provide a diagram (below) outlining the deployment on k8s:

Marquez with Proxy

Note: I used ports 5005 for the proxy in the diagram above as an example.

See: https://kubernetes.github.io/ingress-nginx/user-guide/ingress-path-matching/

wslulciuc commented 2 years ago

@julienledem: should the result of this issue be a design doc outlining our recommendation to standup a proxy (and a couple alternatives) in front of Marquez?