Add autodocumentation capabilities to extract/transform flows

trymzet commented 3 years ago

Eg. the extract flows should track source and destination in a metadata field, eg. meta.

We can use flow_view = FlowView.from_flow_name() and then flow_view.flow to get other flows' graphs/metadata in eg Pipeline flows.

trymzet commented 3 years ago

So the above method is not possible as Prefect doesn't allow storing any custom flow metadata. We could use the existing fields, eg. description, but that would be quite hacky IMO. As a workaround, we could use their separate KV Store feature, storing flow meta under the flow_meta key:

from prefect.backend import get_key_value, set_key_value

def get_flow_meta(flow_name: str) -> dict:
    all_flows_meta = get_key_value("flow_meta")
    flow_meta = all_flows_meta[flow_name]
    return flow_meta

def set_flow_meta(flow_name: str, meta: dict = None, overwrite: bool = True) -> None:
    all_flows_meta = get_key_value("flow_meta")
    meta_exists = bool(all_flows_meta.get(flow_name))

    if meta_exists and not overwrite:
        print("Not overwriting")
        return

    flow_meta = {flow_name: meta}
    all_flows_meta.update(flow_meta)

    try:
        set_key_value("flow_meta", all_flows_meta)
    except ValueError as e:
        print("The 'flow_meta' key is too big. Perhaps remove some old, unused meta?")
        return

get_flow_meta("Hello, world!")

{
   "answer": 43,
   "question": "?"
}

trymzet commented 3 years ago

This is very workaroundish. Probably the best solution is to use a metadata layer on top of the lake ("lakehouse"), such as Dremio, which has a built-in data catalog and lineage. We need it anyway for the capability of querying the lake with SQL (while getting rid of Azure SQL db), so it makes most sense to have it there. Can also consider Amundsen, although it probably wouldn't be needed if we choose Dremio.

trymzet commented 3 years ago

On hold

dyvenia / viadot

Add autodocumentation capabilities to extract/transform flows #96