Closed FRosner closed 8 years ago
@Gerrrr it doesn't seem too complex, looking at http://stackoverflow.com/questions/36023860/how-to-use-a-scala-class-inside-pyspark
Small prototype:
$ pyspark --driver-class-path ./drunken-data-quality-assembly-3.2.0-SNAPSHOT.jar
from uuid import uuid4
class Check(object):
def __init__(self, df):
self.df = df
self.jvm = df._sc._jvm
displayName = self.jvm.scala.Option.empty()
cacheMethod = self.jvm.scala.Option.empty()
constraints = self.jvm.scala.collection.immutable.List.empty()
id = str(uuid4)
self.jvmCheck = self.jvm.de.frosner.ddq.core.Check(df._jdf,
displayName,
cacheMethod,
constraints,
id)
def isNeverNull(self, columnName):
self.jvmCheck = self.jvmCheck.isNeverNull(columnName)
return self
def run(self, reporters):
jvmReporters = jvm.scala.collection.JavaConversions.asScalaBuffer(reporters).toList()
self.jvmCheck.run(jvmReporters)
rdd = sc.parallelize([(1, "a"), (2, "b"), (3, "c")])
df = sqlContext.createDataFrame(rdd)
markdownReporter = jvm.de.frosner.ddq.reporters.MarkdownReporter(jvm.System.out)
check = Check(df)
check.isNeverNull("_1").run([markdownReporter])
Output:
**Checking [_1: bigint, _2: string]**
It has a total number of 2 columns and 3 rows.
- *SUCCESS*: Column _1 is never null.
I cannot commit to your branch so here's the patch that adds the pythonItAssembly
task to SBT, @Gerrrr.
When you run it, it will run assembly and then put the fat jar in python/drunken-data-quality.jar
.
sbt pythonItAssembly
sbt 'set test in assembly := {}' pythonItAssembly
From 01288b9af7d3ea25d1941f1d47de02815d8f3a4b Mon Sep 17 00:00:00 2001
From: Frank Rosner <frank@fam-rosner.de>
Date: Wed, 6 Jul 2016 19:43:23 +0200
Subject: [PATCH] #91 python assembly task to provide fat jar for python tests
---
build.sbt | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/build.sbt b/build.sbt
index 47bec15..3824233 100644
--- a/build.sbt
+++ b/build.sbt
@@ -35,3 +35,7 @@ fork := true
javaOptions += "-Xmx2G"
javaOptions += "-XX:MaxPermSize=512m"
+
+lazy val pythonItAssembly = taskKey[Unit]("python-it-assembly")
+
+pythonItAssembly <<= assembly map { (asm) => s"cp ${asm.getAbsolutePath()} python/drunken-data-quality.jar" ! }
--
2.6.3
Do you need any help @Gerrrr ?
Hi @FRosner, No, I just did not have time last week. I will work on the issue this week.
@Gerrrr Please let me know when the merge request is ready to be merged, as I would like to change also a few minor things in the readme.
Following your instructions from http://peterdowns.com/posts/first-time-with-pypi.html @Gerrrr, I tried
frosner:python/ (issue/91-2*) $ python setup.py register -r pypitest [20:52:41]
running register
running egg_info
writing requirements to pyddq.egg-info/requires.txt
writing pyddq.egg-info/PKG-INFO
writing top-level names to pyddq.egg-info/top_level.txt
writing dependency_links to pyddq.egg-info/dependency_links.txt
[pbr] Reusing existing SOURCES.txt
running check
Registering pyddq to https://testpypi.python.org/pypi
Server response (400): Invalid version, cannot use PEP 440 local versions on PyPI.
Can we try to debug together?
Actually, this guide explicitly says that you should not use register
to do it: https://packaging.python.org/distributing/#uploading-your-project-to-pypi
@FRosner, sure! Let's debug it tomorrow.
The version used by setup.py
is generated by egg_info
(the third line in your output above) and derived from the status of your repository:
✗ cat pyddq.egg-info/PKG-INFO
Metadata-Version: 1.1
Name: pyddq
Version: 3.1.0.post0.dev12+ng367b68b.dirty
Summary: Python binding to Drunken Data Quality
In order to publish a package on Pypi, its version should comply Public version identifiers.
Currently, there is 3.1.0
because this is the latest tag in git repo, post0.dev12+ng367b68b
identifies the commit and dirty
means that there are uncommited changes.
Please commit your changes, make a tag (e.g. 3.2.0
) and try again.
Did you call me dirty?
frosner:python/ (issue/91-2) $ python setup.py register -r pypitest
running register
running egg_info
writing requirements to pyddq.egg-info/requires.txt
writing pyddq.egg-info/PKG-INFO
writing top-level names to pyddq.egg-info/top_level.txt
writing dependency_links to pyddq.egg-info/dependency_links.txt
[pbr] Reusing existing SOURCES.txt
running check
Registering pyddq to https://testpypi.python.org/pypi
Server response (200): OK
👍
@Gerrrr but when I try to pull it I get:
pip install -i https://testpypi.python.org/pypi pyddq
Collecting pyddq
Could not find a version that satisfies the requirement pyddq (from versions: )
No matching distribution found for pyddq
Sorry I should RTFM...
frosner:python/ (issue/91-2) $ python setup.py sdist upload -r pypitest
running sdist
running egg_info
writing requirements to pyddq.egg-info/requires.txt
writing pyddq.egg-info/PKG-INFO
writing top-level names to pyddq.egg-info/top_level.txt
writing dependency_links to pyddq.egg-info/dependency_links.txt
[pbr] Processing SOURCES.txt
[pbr] In git context, generating filelist from git
warning: no files found matching 'AUTHORS'
warning: no files found matching 'ChangeLog'
warning: no previously-included files found matching '.gitreview'
warning: no previously-included files matching '*.pyc' found anywhere in distribution
writing manifest file 'pyddq.egg-info/SOURCES.txt'
running check
creating pyddq-3.2.0
creating pyddq-3.2.0/pyddq
creating pyddq-3.2.0/pyddq.egg-info
making hard links in pyddq-3.2.0...
hard linking README.rst -> pyddq-3.2.0
hard linking setup.cfg -> pyddq-3.2.0
hard linking setup.py -> pyddq-3.2.0
hard linking pyddq/__init__.py -> pyddq-3.2.0/pyddq
hard linking pyddq/core.py -> pyddq-3.2.0/pyddq
hard linking pyddq/jvm_conversions.py -> pyddq-3.2.0/pyddq
hard linking pyddq/reporters.py -> pyddq-3.2.0/pyddq
hard linking pyddq/streams.py -> pyddq-3.2.0/pyddq
hard linking pyddq.egg-info/PKG-INFO -> pyddq-3.2.0/pyddq.egg-info
hard linking pyddq.egg-info/SOURCES.txt -> pyddq-3.2.0/pyddq.egg-info
hard linking pyddq.egg-info/dependency_links.txt -> pyddq-3.2.0/pyddq.egg-info
hard linking pyddq.egg-info/not-zip-safe -> pyddq-3.2.0/pyddq.egg-info
hard linking pyddq.egg-info/requires.txt -> pyddq-3.2.0/pyddq.egg-info
hard linking pyddq.egg-info/top_level.txt -> pyddq-3.2.0/pyddq.egg-info
copying setup.cfg -> pyddq-3.2.0
Writing pyddq-3.2.0/setup.cfg
Creating tar archive
removing 'pyddq-3.2.0' (and everything under it)
running upload
Submitting dist/pyddq-3.2.0.tar.gz to https://testpypi.python.org/pypi
Server response (200): OK
Now the problem is:
sudo docker run -it -p 8000:8000 python:alpine /bin/sh
pip install -i https://testpypi.python.org/pypi pyddq
Collecting pyddq
Downloading https://testpypi.python.org/packages/67/aa/941614e22736a240c5cc0c878e9e16b6cf30090018971254c1c61641434a/pyddq-3.2.0.tar.gz
Complete output from command python setup.py egg_info:
Download error on https://pypi.python.org/simple/pyscaffold/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645) -- Some packages may not be found!
Couldn't find index page for 'pyscaffold' (maybe misspelled?)
Download error on https://pypi.python.org/simple/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645) -- Some packages may not be found!
No local packages or download links found for pyscaffold<2.6a0,>=2.5a0
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-build-9wom9_tk/pyddq/setup.py", line 64, in <module>
setup_package()
File "/tmp/pip-build-9wom9_tk/pyddq/setup.py", line 59, in setup_package
"integration_test": IntegrationTestCommand
File "/usr/local/lib/python3.5/distutils/core.py", line 108, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/local/lib/python3.5/site-packages/setuptools/dist.py", line 268, in __init__
self.fetch_build_eggs(attrs['setup_requires'])
File "/usr/local/lib/python3.5/site-packages/setuptools/dist.py", line 313, in fetch_build_eggs
replace_conflicting=True,
File "/usr/local/lib/python3.5/site-packages/pkg_resources/__init__.py", line 836, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/usr/local/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1081, in best_match
return self.obtain(req, installer)
File "/usr/local/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1093, in obtain
return installer(requirement)
File "/usr/local/lib/python3.5/site-packages/setuptools/dist.py", line 380, in fetch_build_egg
return cmd.easy_install(req)
File "/usr/local/lib/python3.5/site-packages/setuptools/command/easy_install.py", line 623, in easy_install
raise DistutilsError(msg)
distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pyscaffold<2.6a0,>=2.5a0')
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-9wom9_tk/pyddq/
Do you run into the same error with Python 2?
No @Gerrrr. It works with python2.
sudo docker run -it -p 8000:8000 python:2 /bin/sh
# pip install -i https://testpypi.python.org/pypi pyddq
Collecting pyddq
Downloading https://testpypi.python.org/packages/67/aa/941614e22736a240c5cc0c878e9e16b6cf30090018971254c1c61641434a/pyddq-3.2.0.tar.gz
Building wheels for collected packages: pyddq
Running setup.py bdist_wheel for pyddq ... done
Stored in directory: /root/.cache/pip/wheels/4b/c6/1d/4af8e6e3ed0d727bff85bbb8114c0896e8cc78bb2050ded0e5
Successfully built pyddq
Installing collected packages: pyddq
Successfully installed pyddq-3.2.0
ToDos
README.md
: consider renaming "Python binding" to "Python API"README.md
: consider moving the Python API section at the end of the usage section for the usage and thepip install
command in the "Getting DDQ" section already. This way, people are already aware at the beginning of the readme file that there will be a python API coming.README.md
: Capitalize "Scala" 📦python/README.rst
: Fill long description and review the restReferences