FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

PySpark API #91

Closed FRosner closed 8 years ago

FRosner commented 8 years ago

ToDos

FRosner commented 8 years ago

@Gerrrr it doesn't seem too complex, looking at http://stackoverflow.com/questions/36023860/how-to-use-a-scala-class-inside-pyspark

Gerrrr commented 8 years ago

Small prototype: $ pyspark --driver-class-path ./drunken-data-quality-assembly-3.2.0-SNAPSHOT.jar

from uuid import uuid4

class Check(object):
    def __init__(self, df):
        self.df = df
        self.jvm = df._sc._jvm
        displayName = self.jvm.scala.Option.empty()
        cacheMethod = self.jvm.scala.Option.empty()
        constraints = self.jvm.scala.collection.immutable.List.empty()
        id = str(uuid4)
        self.jvmCheck = self.jvm.de.frosner.ddq.core.Check(df._jdf,
                                                      displayName,
                                                      cacheMethod,
                                                      constraints,
                                                      id)
    def isNeverNull(self, columnName):
        self.jvmCheck = self.jvmCheck.isNeverNull(columnName)
        return self
    def run(self, reporters):
        jvmReporters = jvm.scala.collection.JavaConversions.asScalaBuffer(reporters).toList()
        self.jvmCheck.run(jvmReporters)

rdd = sc.parallelize([(1, "a"), (2, "b"), (3, "c")])
df = sqlContext.createDataFrame(rdd)
markdownReporter = jvm.de.frosner.ddq.reporters.MarkdownReporter(jvm.System.out)

check = Check(df)
check.isNeverNull("_1").run([markdownReporter])

Output:

**Checking [_1: bigint, _2: string]**

It has a total number of 2 columns and 3 rows.

- *SUCCESS*: Column _1 is never null.
FRosner commented 8 years ago

I cannot commit to your branch so here's the patch that adds the pythonItAssembly task to SBT, @Gerrrr.

When you run it, it will run assembly and then put the fat jar in python/drunken-data-quality.jar.

From 01288b9af7d3ea25d1941f1d47de02815d8f3a4b Mon Sep 17 00:00:00 2001
From: Frank Rosner <frank@fam-rosner.de>
Date: Wed, 6 Jul 2016 19:43:23 +0200
Subject: [PATCH] #91 python assembly task to provide fat jar for python tests

---
 build.sbt | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/build.sbt b/build.sbt
index 47bec15..3824233 100644
--- a/build.sbt
+++ b/build.sbt
@@ -35,3 +35,7 @@ fork := true
 javaOptions += "-Xmx2G"

 javaOptions += "-XX:MaxPermSize=512m"
+
+lazy val pythonItAssembly = taskKey[Unit]("python-it-assembly")
+
+pythonItAssembly <<= assembly map { (asm) => s"cp ${asm.getAbsolutePath()} python/drunken-data-quality.jar" ! }
--
2.6.3
FRosner commented 8 years ago

Do you need any help @Gerrrr ?

Gerrrr commented 8 years ago

Hi @FRosner, No, I just did not have time last week. I will work on the issue this week.

FRosner commented 8 years ago

@Gerrrr Please let me know when the merge request is ready to be merged, as I would like to change also a few minor things in the readme.

FRosner commented 8 years ago

Following your instructions from http://peterdowns.com/posts/first-time-with-pypi.html @Gerrrr, I tried

frosner:python/ (issue/91-2*) $ python setup.py register -r pypitest                      [20:52:41]
running register
running egg_info
writing requirements to pyddq.egg-info/requires.txt
writing pyddq.egg-info/PKG-INFO
writing top-level names to pyddq.egg-info/top_level.txt
writing dependency_links to pyddq.egg-info/dependency_links.txt
[pbr] Reusing existing SOURCES.txt
running check
Registering pyddq to https://testpypi.python.org/pypi
Server response (400): Invalid version, cannot use PEP 440 local versions on PyPI.

Can we try to debug together?

FRosner commented 8 years ago

Actually, this guide explicitly says that you should not use register to do it: https://packaging.python.org/distributing/#uploading-your-project-to-pypi

Gerrrr commented 8 years ago

@FRosner, sure! Let's debug it tomorrow.

Gerrrr commented 8 years ago

The version used by setup.py is generated by egg_info (the third line in your output above) and derived from the status of your repository:

✗ cat pyddq.egg-info/PKG-INFO
Metadata-Version: 1.1
Name: pyddq
Version: 3.1.0.post0.dev12+ng367b68b.dirty
Summary: Python binding to Drunken Data Quality

In order to publish a package on Pypi, its version should comply Public version identifiers.

Currently, there is 3.1.0 because this is the latest tag in git repo, post0.dev12+ng367b68b identifies the commit and dirty means that there are uncommited changes.

Please commit your changes, make a tag (e.g. 3.2.0) and try again.

FRosner commented 8 years ago

Did you call me dirty?

FRosner commented 8 years ago
frosner:python/ (issue/91-2) $ python setup.py register -r pypitest

running register
running egg_info
writing requirements to pyddq.egg-info/requires.txt
writing pyddq.egg-info/PKG-INFO
writing top-level names to pyddq.egg-info/top_level.txt
writing dependency_links to pyddq.egg-info/dependency_links.txt
[pbr] Reusing existing SOURCES.txt
running check
Registering pyddq to https://testpypi.python.org/pypi
Server response (200): OK

👍

FRosner commented 8 years ago

@Gerrrr but when I try to pull it I get:

pip install -i https://testpypi.python.org/pypi pyddq
Collecting pyddq
  Could not find a version that satisfies the requirement pyddq (from versions: )
No matching distribution found for pyddq

image

FRosner commented 8 years ago

Sorry I should RTFM...

frosner:python/ (issue/91-2) $ python setup.py sdist upload -r pypitest
running sdist
running egg_info
writing requirements to pyddq.egg-info/requires.txt
writing pyddq.egg-info/PKG-INFO
writing top-level names to pyddq.egg-info/top_level.txt
writing dependency_links to pyddq.egg-info/dependency_links.txt
[pbr] Processing SOURCES.txt
[pbr] In git context, generating filelist from git
warning: no files found matching 'AUTHORS'
warning: no files found matching 'ChangeLog'
warning: no previously-included files found matching '.gitreview'
warning: no previously-included files matching '*.pyc' found anywhere in distribution
writing manifest file 'pyddq.egg-info/SOURCES.txt'
running check
creating pyddq-3.2.0
creating pyddq-3.2.0/pyddq
creating pyddq-3.2.0/pyddq.egg-info
making hard links in pyddq-3.2.0...
hard linking README.rst -> pyddq-3.2.0
hard linking setup.cfg -> pyddq-3.2.0
hard linking setup.py -> pyddq-3.2.0
hard linking pyddq/__init__.py -> pyddq-3.2.0/pyddq
hard linking pyddq/core.py -> pyddq-3.2.0/pyddq
hard linking pyddq/jvm_conversions.py -> pyddq-3.2.0/pyddq
hard linking pyddq/reporters.py -> pyddq-3.2.0/pyddq
hard linking pyddq/streams.py -> pyddq-3.2.0/pyddq
hard linking pyddq.egg-info/PKG-INFO -> pyddq-3.2.0/pyddq.egg-info
hard linking pyddq.egg-info/SOURCES.txt -> pyddq-3.2.0/pyddq.egg-info
hard linking pyddq.egg-info/dependency_links.txt -> pyddq-3.2.0/pyddq.egg-info
hard linking pyddq.egg-info/not-zip-safe -> pyddq-3.2.0/pyddq.egg-info
hard linking pyddq.egg-info/requires.txt -> pyddq-3.2.0/pyddq.egg-info
hard linking pyddq.egg-info/top_level.txt -> pyddq-3.2.0/pyddq.egg-info
copying setup.cfg -> pyddq-3.2.0
Writing pyddq-3.2.0/setup.cfg
Creating tar archive
removing 'pyddq-3.2.0' (and everything under it)
running upload
Submitting dist/pyddq-3.2.0.tar.gz to https://testpypi.python.org/pypi
Server response (200): OK
FRosner commented 8 years ago

Now the problem is:

sudo docker run -it -p 8000:8000 python:alpine /bin/sh
pip install -i https://testpypi.python.org/pypi pyddq
Collecting pyddq
  Downloading https://testpypi.python.org/packages/67/aa/941614e22736a240c5cc0c878e9e16b6cf30090018971254c1c61641434a/pyddq-3.2.0.tar.gz
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.python.org/simple/pyscaffold/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645) -- Some packages may not be found!
    Couldn't find index page for 'pyscaffold' (maybe misspelled?)
    Download error on https://pypi.python.org/simple/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645) -- Some packages may not be found!
    No local packages or download links found for pyscaffold<2.6a0,>=2.5a0
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-9wom9_tk/pyddq/setup.py", line 64, in <module>
        setup_package()
      File "/tmp/pip-build-9wom9_tk/pyddq/setup.py", line 59, in setup_package
        "integration_test": IntegrationTestCommand
      File "/usr/local/lib/python3.5/distutils/core.py", line 108, in setup
        _setup_distribution = dist = klass(attrs)
      File "/usr/local/lib/python3.5/site-packages/setuptools/dist.py", line 268, in __init__
        self.fetch_build_eggs(attrs['setup_requires'])
      File "/usr/local/lib/python3.5/site-packages/setuptools/dist.py", line 313, in fetch_build_eggs
        replace_conflicting=True,
      File "/usr/local/lib/python3.5/site-packages/pkg_resources/__init__.py", line 836, in resolve
        dist = best[req.key] = env.best_match(req, ws, installer)
      File "/usr/local/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1081, in best_match
        return self.obtain(req, installer)
      File "/usr/local/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1093, in obtain
        return installer(requirement)
      File "/usr/local/lib/python3.5/site-packages/setuptools/dist.py", line 380, in fetch_build_egg
        return cmd.easy_install(req)
      File "/usr/local/lib/python3.5/site-packages/setuptools/command/easy_install.py", line 623, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pyscaffold<2.6a0,>=2.5a0')

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-9wom9_tk/pyddq/
Gerrrr commented 8 years ago

Do you run into the same error with Python 2?

FRosner commented 8 years ago

No @Gerrrr. It works with python2.

sudo docker run -it -p 8000:8000 python:2 /bin/sh

# pip install -i https://testpypi.python.org/pypi pyddq
Collecting pyddq
  Downloading https://testpypi.python.org/packages/67/aa/941614e22736a240c5cc0c878e9e16b6cf30090018971254c1c61641434a/pyddq-3.2.0.tar.gz
Building wheels for collected packages: pyddq
  Running setup.py bdist_wheel for pyddq ... done
  Stored in directory: /root/.cache/pip/wheels/4b/c6/1d/4af8e6e3ed0d727bff85bbb8114c0896e8cc78bb2050ded0e5
Successfully built pyddq
Installing collected packages: pyddq
Successfully installed pyddq-3.2.0