datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.76k stars 2.88k forks source link

metadata-ingestion: Update great-expectations dependency from 0.15 to 0.16 #8115

Open vrld opened 1 year ago

vrld commented 1 year ago

Currently, DataHub depends on great-expectations <= 0.15.50, which is no longer actively maintained. The latest version is 0.16.13, which adds Fluent Datasources that make GX much more user friendly.

However, the new releases remove deprecated code that is used by DataHub, e.g., SQLAlchemyDataset/Datasource in the data profiler and probably some data-asset related stuff in the GX action.

Please update the dependency to 0.16 so that our users can use the new GX version with the datahub action.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 30 days since being marked as stale.

jelledv commented 11 months ago

Any update on this?

hsheth2 commented 10 months ago

This issue is on our radar, but unfortunately isn't a simple fix because of the level of customization and patching we've done in our existing GX-based data profilers. We've had some conversations with the GX team around what it would take to get this done, and are working to scope it accordingly.

DSchmidtDev commented 7 months ago

any updates on this issue? I mean GX is at 0.18 in the meantime :)

mateocolina commented 6 months ago

Any updates on this? they are about to move to 1.x.x :)

KulykDmytro commented 5 months ago

to make datahub work with recent airflow need to bump GE at least to 0.16.8

currently it clinches with urllib3 version pinned in older GE versions to 1.26, while airflow pinned to 2.x

and botocore for python 3.10+ too

Thus, great-expectations (>=0.15.12,<0.15.50) requires urllib3 (>=1.25.4,<1.27)
VladShuvalov commented 5 months ago

Are there any loose timelines around when this can be resolved?

cburroughs commented 5 months ago

I'm sorry, I know these sorts of "me too" comments are rarely of much help. I wanted to highlight that great-expectations at the pinned version has a variety of upper bounds constraints: https://raw.githubusercontent.com/great-expectations/great_expectations/0.15.50/requirements.txt

altair>=4.0.0,<4.2.1
pydantic>=1.10.4,<2.0
urllib3>=1.25.4,<1.27

And at least for us the problem isn't so much that "great expectations is old" but that being on the lower side of these transitive dependencies -- like the pydantic v1-v2 transitions -- has ever increasing opportunity costs. (In our particular transitive set pydantic <2 is also keeping us on pandas<2, which adds further to the expense.)

I know this doesn't change anything about the difficulty of migration, but I hope it clarifies the "cost" somewhat when this issue is next triaged.

am2222 commented 4 months ago

Any updates on this? The latest version of datahub_action for GX also needs to get updated to reflect the latest changes. It is a one line change tho.

shirshanka commented 3 months ago

Just want to clarify which of these issues people are trying to solve:

  1. Use datahub_action with latest GX
  2. Install datahub ingestion sources inside one big venv (e.g. airflow)
am2222 commented 3 months ago

@shirshanka for the datahub action to work with the latest version of GX I managed to just modify a couple of lines of code to fix the class constructor function. But the bigger issue is that if we have airflow installed with the datahub plugin we cannot use the latest version of GX in our dags due to version conflict.

cburroughs commented 3 months ago

Install datahub ingestion sources inside one big venv (e.g. airflow)

This one. We use a monorepo and minimizing the number of transitive dependency sets we are juggling maximizes the usefulness of said monorepo.

jskrzypek commented 2 months ago

@shirshanka It looks like the changes that introduced pydantic v2 support in great-expectations will be easy to backport to 0.15.50. If I do that, would datahub consider using them as a springboard to support pydantic v2 for plugins?

jskrzypek commented 2 months ago

If anyone wants it, I pushed it up to my fork, and here's the diff from 0.15.50. I am going to try patching datahub on a fork to consume this version of great expectations, and see if that works for us.

hsheth2 commented 1 month ago

We've done some work on our end in #11096. The main outcome of that is the GX validation action now lives in the acryl-datahub-gx-plugin package (published here https://pypi.org/project/acryl-datahub-gx-plugin/) instead of acryl-datahub[great-expectations], and supports newer versions of GX in addition to 0.15.x. It's currently an rc, pending a bit of manual testing we want to do.

For ingestion (e.g. snowflake/bigquery/redshift/other sql sources), we still depend on GX 0.15.50 for profiling, and that remains a particularly tricky dependency to loosen given the extent of the monkey-patching we've done to improve query efficiency.

If you're using the only the Python SDKs, you usually can install acryl-datahub or maybe add a limited set of plugins e.g. acryl-datahub[sql-parser] and hence avoid seeing our pin on GX.

We recommend not installing full ingestion sources into your main environment (e.g. avoid having a dependency on acryl-datahub[snowflake]), and recommend either using UI-based ingestion or isolating the programmatic ingestion pipelines using venvs. For Airflow, we have an example using the PythonVirtualenvOperator in our docs.

However, I recognize that this isn't a full fix yet, and so I'll be leaving this issue open for now.

@jskrzypek those improvements sounds great - we'd definitely be open to using the forked GX version that supports pydantic v2. The core acryl-datahub SDK already supports both pydantic 1 and 2, but many of our sources still require v1 because of the GX dependency.

jskrzypek commented 1 month ago

@hsheth2 cool! Please feel free to just take over my fork of GX if you want. It shouldn't require much ongoing maintenance, but I don't really have the time or bandwidth to keep up with it.

I am not sure if GX would consider adopting it themselves, but imagine a request to do so will be more well received if it comes from a project like datahub – our company doesn't use GX directly.