Closed peter-mcclonski closed 1 year ago
Specifically tagging @lecardozo for visibility. Issue seems to be here:
My suspicion is that the subprocess doesn't fully close out the spark context, or at least it isn't garbage collected quickly enough, and the SparkSession builder somehow isn't able to find it across process boundaries? Not entirely sure, but replacing this with the following snippet works for me locally:
def _get_spark_version() -> str:
# Get version from a subprocess so we don't mess up with existing SparkContexts.
session = SparkSession.builder.getOrCreate()
version = session.version
spark_version = version[:-2]
session.stop()
return spark_version
Thanks for the heads up, @pajablons! 😃
My suspicion is that the subprocess doesn't fully close out the spark context, or at least it isn't garbage collected quickly enough, and the SparkSession builder somehow isn't able to find it across process boundaries?
That might be what's happening. I opted for this "weird" subprocess-based solution because tests were failing with a solution similar to yours. Tests failed because, when the tests tried setting up a new context, actually just recycled the context that was already created inside _get_spark_version
. As this was created without the deequ_maven_coord
configuration, the dependencies were not downloaded, breaking the tests.
I have a feeling that your solution works in this case because you are just recycling the already created session (i.e. just getting, instead of creating). 🤔 Can you confirm to me if you could properly run tests with your suggested approach?
We might need to get back to the previous, explicit, approach of using the SPARK_VERSION
. That way we can make sure that we do not interfere with users SparkContext
creation.
On an additional note: I feel this _get_spark_version
is a weird way to make the life of the user better, by telling it exactly the Maven coordinate that should be installed at runtime. I personally don't use the deequ_maven_config
in production, though I agree it's useful for local testing.
Would you mind trying out the revert from this branch?
pip install https://github.com/lecardozo/python-deequ/archive/fix/get-spark-version.zip
Let me know if this is working for you, or if you have any suggestions. 😃
That revert works perfectly for me, with one little detail caveat-- It's not completely future proof (which isn't a new thing or a big deal). This line:
Will obviously break if by some miracle Apache decides to drop spark 3.10.0 on us or something. It's one of those super benign things that is probably never going to be an issue, but it's worth noting I think.
Closing after the revert. Thanks everyone for contributing and reporting.
Describe the bug With the newest commit giving support to Spark 3.1/3.2, pydeequ uses a new method for detecting the (py)spark version. By spawning a new subprocess within the Spark JVM, and not appropriately cleaning up the created SparkContext, subsequent calls to SparkSession.builder.getOrCreate() fail as it attempts to create a secondary SparkContext. This bug was not present historically, when the Spark version was detected through an environment variable.
Error output:
To Reproduce On a Spark 3.2.2 (likely 3.x) installation:
pip install git+https://github.com/awslabs/python-deequ.git@7ec9f6f72839779f370a4753c0db26d6cf052203
$SPARK_HOME/jobs/test.py
with the following contents:$SPARK_HOME/bin/spark-submit jobs/test.py
Expected behavior PySpark won't crash when I try to make a spark session.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):