Open trymzet opened 3 years ago
So the above method is not possible as Prefect doesn't allow storing any custom flow metadata. We could use the existing fields, eg. description, but that would be quite hacky IMO. As a workaround, we could use their separate KV Store
feature, storing flow meta under the flow_meta
key:
from prefect.backend import get_key_value, set_key_value
def get_flow_meta(flow_name: str) -> dict:
all_flows_meta = get_key_value("flow_meta")
flow_meta = all_flows_meta[flow_name]
return flow_meta
def set_flow_meta(flow_name: str, meta: dict = None, overwrite: bool = True) -> None:
all_flows_meta = get_key_value("flow_meta")
meta_exists = bool(all_flows_meta.get(flow_name))
if meta_exists and not overwrite:
print("Not overwriting")
return
flow_meta = {flow_name: meta}
all_flows_meta.update(flow_meta)
try:
set_key_value("flow_meta", all_flows_meta)
except ValueError as e:
print("The 'flow_meta' key is too big. Perhaps remove some old, unused meta?")
return
get_flow_meta("Hello, world!")
{
"answer": 43,
"question": "?"
}
This is very workaroundish. Probably the best solution is to use a metadata layer on top of the lake ("lakehouse"), such as Dremio, which has a built-in data catalog and lineage. We need it anyway for the capability of querying the lake with SQL (while getting rid of Azure SQL db), so it makes most sense to have it there. Can also consider Amundsen, although it probably wouldn't be needed if we choose Dremio.
On hold
Eg. the extract flows should track source and destination in a metadata field, eg.
meta
.We can use
flow_view = FlowView.from_flow_name()
and thenflow_view.flow
to get other flows' graphs/metadata in egPipeline
flows.