awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

feat(Spark3Support): Adding support to pyspark 3.0 #41

Closed chethanuk closed 3 years ago

chethanuk commented 3 years ago

Issue #, if available: May be https://github.com/awslabs/deequ/issues/310 #4

Description of changes: Adding support to pyspark 3.0 Generally faced few properly because of current setup.py files so Migrating from setup to Poetry which is one of the best python dependency management and packaging

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

chethanuk commented 3 years ago

cc: @gucciwang @jaoanan1126 Once this is merged I will work on pytest migration [to simplify the tests]

rubenssoto commented 3 years ago

any updates on this? I want to use pydeequ on my project.

thank you

gucciwang commented 3 years ago

Working on this! Doesn't seem to be passing tests at the moment, will reach out!

chethanuk commented 3 years ago

If required let's merge #40 first, since it will add all workflows

Or Locally install poetry and Install dependencies then https://github.com/awslabs/python-deequ/pull/40/files#diff-f279a155e67df88ab450450a567b747e2db4e3db42eeb2b241a86693eb2a214cR133

My local output:

> poetry version
pydeequ 0.1.8
> poetry run coverage run -m pytest --reruns 5 --reruns-delay 30
====================================================== test session starts ======================================================
platform darwin -- Python 3.8.7, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /Users/chethanuk/Work/Github/deequ/python-deequ
plugins: flake8-1.0.7, rerunfailures-9.1.1, cov-2.12.0
collected 156 items

tests/test_analyzers.py .........................sssssssssssssssssssss                                                               [ 29%]
tests/test_anomaly_detection.py .s...ss                                                                                              [ 33%]
tests/test_checks.py .......ssssssssssssssssssssssssssssssssss................................                                       [ 80%]
tests/test_pandas_utils.py ......                                                                                                    [ 84%]
tests/test_profiles.py ...                                                                                                           [ 86%]
tests/test_pydeequ.py .                                                                                                              [ 87%]
tests/test_repository.py ....ss....                                                                                                  [ 93%]
tests/test_scala_utils.py ..                                                                                                         [ 94%]
tests/test_suggestions.py ........                                                                                                   [100%]

============================================================= warnings summary =============================================================
tests/test_pandas_utils.py::TestPandasUtils::test_p2s_analyzer
tests/test_pandas_utils.py::TestPandasUtils::test_p2s_profiles
tests/test_pandas_utils.py::TestPandasUtils::test_p2s_suggestion
tests/test_pandas_utils.py::TestPandasUtils::test_p2s_verification
tests/test_pandas_utils.py::TestPandasUtils::test_s2p_analyzers
tests/test_pandas_utils.py::TestPandasUtils::test_s2p_verification
  /Users/chethanuk/Work/Github/deequ/python-deequ/pydeequ/pandas_utils.py:26: UserWarning: WARNING: 
You passed in a Pandas DF, so we will be using our experimental utility to convert it to a PySpark DF.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
========================================== 96 passed, 60 skipped, 6 warnings in 115.31s (0:01:55) ==========================================
> poetry run coverage report
Name                                                                                Stmts   Miss  Cover
-------------------------------------------------------------------------------------------------------
.venv/lib/python3.8/site-packages/_pytest/__init__.py                                   5      2    60%
.venv/lib/python3.8/site-packages/_pytest/_argcomplete.py                              37     23    38%
.venv/lib/python3.8/site-packages/_pytest/_code/__init__.py                            10      0   100%
.venv/lib/python3.8/site-packages/_pytest/_code/code.py                               699    380    46%

> spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.2
      /_/

Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_282
Branch HEAD
Compiled by user centos on 2021-02-16T04:53:13Z
Revision 648457905c4ea7d00e3d88048c63f360045f0714
Url https://gitbox.apache.org/repos/asf/spark.git
cenh commented 3 years ago

Would love for support to pyspark 3.0 and 3.1, any idea on when this would be available?

chethanuk commented 3 years ago

Ping!. @gucciwang