awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

check with hasSize/hasMin etc fails with below error. Pydeequ version 1.0.0 #64

Open Sankeernalk opened 3 years ago

Sankeernalk commented 3 years ago

Describe the bug verificationSuite check with has fails with error - Traceback (most recent call last): File "/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 2146, in start OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/DeequPy/adhoc_tasks/test_checks.py", line 19, in check.hasSize(lambda x: x >= 100) File "/opt/anaconda3/envs/DeequPy/lib/python3.7/site-packages/pydeequ/checks.py", line 134, in hasSize assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) File "/opt/anaconda3/envs/DeequPy/lib/python3.7/site-packages/pydeequ/scala_utils.py", line 32, in init super().init(gateway) File "/opt/anaconda3/envs/DeequPy/lib/python3.7/site-packages/pydeequ/scala_utils.py", line 16, in init self.gateway.start_callback_server() File "/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1836, in start_callback_server File "/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 2155, in start py4j.protocol.Py4JNetworkError: An error occurred while trying to start the callback server (127.0.0.1:25334)

To Reproduce Steps to reproduce the behavior:

check.hasSize(lambda x: x >= 100)

result = VerificationSuite(spark).onData(df).addCheck(check).run()

result_df = VerificationResult.checkResultsAsDataFrame(spark, result)

result_df.show()

oscarcampos-c commented 1 year ago

I was having the same issue, please make sure you shutdown the spark app before spawning another one. This solved it for me spark.sparkContext._gateway.shutdown_callback_server()

Then

spark.stop()

Ashokgoa commented 9 months ago

Hi @chenliu0831 @ammar-nizami @oscarcampos-c How did you handle this issue if you run many spark jobs which is running pydeequ checks at the same time?

in my case only one job is running rest of the other jobs are failing with same issue.

Traceback (most recent call last): File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2207, in start OSError: [Errno 98] Address already in use

During handling of the above exception, another exception occurred: Any solutions or suggestions

Traceback (most recent call last): File "/opt/ammar/pydeequ_poc_pyspark.py", line 26, in _check = _check_func(*_args) File "/usr/local/lib/python3.7/site-packages/pydeequ/checks.py", line 134, in hasSize assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) File "/usr/local/lib/python3.7/site-packages/pydeequ/scala_utils.py", line 32, in init super().init(gateway) File "/usr/local/lib/python3.7/site-packages/pydeequ/scala_utils.py", line 16, in init self.gateway.start_callback_server() File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1894, in start_callback_server File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2216, in start py4j.protocol.Py4JNetworkError: An error occurred while trying to start the callback server (127.0.0.1:25334) 21/11/19 13:44:06 INFO SparkContext: Invoking stop() from shutdown hook