PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
17.49k stars 1.64k forks source link

Design/discuss integration of Great Expectations with Prefect (potentially in Results feature) #2436

Closed lauralorenz closed 3 years ago

lauralorenz commented 4 years ago

Discussion issue to figure out a plan to make Great Expectations integrate seamlessly with Prefect.

In general we are assuming that people want to be able to configure Great Expectations data validators to track onto pieces of data passed through a flow, and want the pipeline to manage actually calling the GE API for them and surfacing errors/alerts/failures in some way that feels totally integrated into Prefect.

Some user experience questions to consider:

a) What is an example of the Python API a core user would use to attach a single great expectation assertion to a task? Multiple? The same one to many tasks? a1) Do they even attach them to tasks? Do they attach them to something else (ie Results?) a2) What about a Core server/Cloud user: is there ever a world where GE validators are configured directly through the UI?

b) How are the validation results from great expectations surfaced in the Prefect logs? Are they visualized somehow in the UI?

c) Are there assumptions or conventions Prefect should make/support to autodetect GE assertions on disk? How does this relate to the "Expectations on rails" framework in beta in GE? c1) how does Prefect (and --dun dun dun-- dask) play with the fact that Great Expectations is mainly configured using file based configuration?

d) Where/how in the pipeline do we do validation checks for people? Where/how should users configure to turn this check on or off, besides removing the validator from the code (for example, in a global configuration toggle?)

e) If people call great expectation validators themselves in a task via a Prefect API such as Result.validate(), do we do anything special for them with the output?

f) can we provide better, Prefect-based semantics for what to do on failure of a GE validation since the pipeline can control the execution flow in reaction to the validation failure (ex potentially allow flow configuration for retrying up a task tree when a downstream validator fails)

g) can we integrate the GE data docs metadata into our UI somehow (simplest case is to link out, though this can get infinitely more fancy)

**Curveball question: Is there a need (either in addition or in replacement of integrated pipeline checks) of an abstract GE task in the task library that can be easily used as a terminal/reference task?

lauralorenz commented 4 years ago

Based on conversations last week tl;dr IMHO we should move forward immediately only with a Task Library task that exposes an ad-hoc validation configurable with user-configured data sources and validators. There is some good advice of how to integrate Great Expectations as a node in a pipeline framework in this way on their docs here. Along with that we should provide a tutorial/docs here and in GE docs that explain how to use the Prefect task to add GE validation to a pipeline. UPDATE: Issue for this is at https://github.com/PrefectHQ/prefect/issues/2489

This is motivated mostly by conversations in slack, where a few users mentioned they do use Great Expectations in their Prefect pipelines either now or in test, but more on the basis of ad-hoc/final testing of data at the end of an ETL, not necessarily regularly along the way (though there was recognition that it could be useful for complex data intermediates). This motivates the task library use case more than the work to embed validation throughout the pipeline, since they can be integrated as needed at the end or for the few complex data intermediaries as their own task nodes.

I personally don't think we have enough information or motivation yet to pursue the more heavy integrations, but I have edited OP with questions related to the design of a heavier integration based on conversations to date and am leaving the issue open as I collect more information and welcome discussion!

lcorneliussen commented 4 years ago

Would be nice to see a deeper integration where the run results are shown in prefects UI. No it is quite a hastle to get the data docs up on some static site, configure access rights, ...

How are the plans currently?

lcorneliussen commented 4 years ago

Hi @lauralorenz,

I got the simple integration running and could imagine to contribute in showing more information about GE runs from within prefect. A first thing could be to get a log output with which validations failed - or a link to the data_docs (published to S3 in my case).

lauralorenz commented 4 years ago

@lcorneliussen awesome! FWIW I personally imagined the failed validations object that is returned from the GE prefect task might be parsed apart in a state handler (either custom by the user or one we could ship in the GE module). A link to the data docs (if configured) as a log, and/or somehow conditionally shown in the server nav as part of https://github.com/prefecthq/ui would be super helpful I think!

lauralorenz commented 4 years ago

Also, pulling over some updates/talk from slack DMs into general posterity:

My opinions on the next major step regarding a more "heavy integration" (as was abandoned lacking sufficient motivation at the end of https://github.com/PrefectHQ/prefect/issues/2436#issuecomment-623498949) is represented in https://github.com/PrefectHQ/prefect/issues/3057, though I personally won't be working on it any time soon. If any other GE fans are hanging around this or that issue, I'm definitely open to talk about it!

lcorneliussen commented 4 years ago

Some inspiration: https://greatexpectations.io/blog/dagster-integration-announcement/