awslabs / python-deequ

Python API for Deequ
Apache License 2.0
702 stars 132 forks source link

Unable to compute Uniqueness/UniqueValueRatio #8

Closed ol-eg closed 3 years ago

ol-eg commented 3 years ago

Describe the bug Trying the lib on jupyter notebook, the example from quickstart. When trying to compute Uniqueness or UniqueValueRation facing java exception 'Method iterableAsScalaIterable([class java.lang.String]) does not exist'

To Reproduce Steps to reproduce the behaviour:

  1. start jupyter notebook in docker, one of the latest images from jupyter team: $ docker run -p 8888:8888 jupyter/all-spark-notebook:a0a544e6dc6e
  2. Install pydeequ from inside jupyter notebook !pip install pydeequ
  3. Follow the quickstart instructions to set up spark session and run deequ analysers. At this point everything works, however adding analyser .addAnalyzer(Uniqueness("b")) Brings the error. (screenshots attached).

Screenshots image image image image

Additional context I have tried few different combinations of spark/scala/amazon deequ libs versions but did not manage to make this work.

gucciwang commented 3 years ago

Hi! Apologies for the late reply amidst the holidays, but my guess is that your spark session is unable to access the deequ-1.0.3.jar. We leverage ivy to download the jar from maven, so perhaps there is a disconnect of where those jars are stored between docker and your main machine. From the looks of the screenshot, you listed your jars in /usr/local/spark/jars/ whereas ivy downloaded deequ-1.0.3.jar into /home/jovyan/.ivy2/jars/.

Does your notebook work with just running the addAnalyzer(Size())? That would reinforce the fact that the sparksession is unable to access the deequ jar.

Also, we have only developed and supported up to deequ-1.0.3, so please stick to that version!

ol-eg commented 3 years ago

Hi, thx for coming back. That is I thinks the issue: image

I think for me the fix will be to find the image with compatible spark/scala versions, and configure pyspark with extra ivy2 path.

thx vm.