Closed hardiktalati closed 2 years ago
I see you are using sparkmeasure from PySpark. Can you please try the full example in the README.md and let me know if you still see an error?
# Python CLI, Spark 3.x
pip install sparkmeasure
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(100)").show()')
Hey Luca,
It worked on my work laptop - i am trying on my personal and its behaving again in sameway.
from pyspark.sql import SparkSession import os os.environ['SPARK_HOME'] = "C:\spark-3.2.1-bin-hadoop3.2" spark = SparkSession.builder\ .config("spark.jars.packages", "ch.cern.sparkmeasure:spark-measure_2.12:0.17").getOrCreate()
from sparkmeasure import StageMetrics stagemetrics = StageMetrics(spark)
ERROR raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling None.ch.cern.sparkmeasure.StageMetrics. : java.lang.NoClassDefFoundError: scala/Product$class
I would like to check which version of Spark you are using. I suspect you may be using Spark 2.x with scala 2.11. Can you please run: spark.version
?
Instead of setting SPARK_HOME I'd suggest the use of findspark:
import findspark
findspark.init("SPARK_HOME_HERE")
BTW I have since released sparkmeasure version 0.18.
Hey Luca, Its working now.. Thanks for the help. It was issue with my java version or when i upgraded to latest sparkmeasure
Thanks for the feedback.
Hey @hardiktalati What version of SparkMeasure, Java and pyspark worked for you?
I'm running sparkmeasure==0.23.0, pyspark==3.3 and java==11 and I'm getting the following error:
py4j.protocol.Py4JError: ch.cern.sparkmeasure.TaskMetrics does not exist in the JVM
.
@LucaCanali Your input here would be greatly appreciated too.
Many thanks both, in advanced.
Hi @mccartni-aws does something like this work for you?
# Example of how to use sparkMeasure from the Spark Python CLI, Spark 3.x
# install the Python wrapper
pip install sparkmeasure
# run PySpark with sparkMeasure library jars
# Note: if running on a notebook or Python directly, instead of PySpark, make sure to have the config spark.jars.packages set to ch.cern.sparkmeasure:spark-measure_2.12:0.23
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23
# StageMetrics example
from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(100)").show()')
# TaskMetrics example
from sparkmeasure import TaskMetrics
taskmetrics = TaskMetrics(spark)
taskmetrics.runandmeasure(locals(), 'spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(100)").show()')
Thanks a lot @LucaCanali.
I'm running it from a SageMaker processing job, not from a notebook. I will try it out now in a notebook, though this will not be my end use-case.
I'm wondering whether it's anything to do with me not setting the correct spark path and Py4j path.
Do these versions look good to you? sparkmeasure==0.23.0, pyspark==3.3 and java==11
I confirm that sparkMeasure works OK for me with sparkmeasure==0.23.0, pyspark/spark==3.3.x and java==11 BTW, no need to run on a notebook, running from PySpark CLI is even easier to test.
File "C:\Users\talathar\Miniconda3\envs\XXXXX\lib\site-packages\sparkmeasure\stagemetrics.py", line 15, in init self.stagemetrics = self.sc._jvm.ch.cern.sparkmeasure.StageMetrics(self.sparksession._jsparkSession) File "C:\spark-3.2.1-bin-hadoop3.2\python\lib\py4j-0.10.9.3-src.zip\py4j\java_gateway.py", line 1586, in call File "C:\spark-3.2.1-bin-hadoop3.2\python\pyspark\sql\utils.py", line 111, in deco return f(*a, **kw) File "C:\spark-3.2.1-bin-hadoop3.2\python\lib\py4j-0.10.9.3-src.zip\py4j\protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.ch.cern.sparkmeasure.StageMetrics. : java.lang.NoClassDefFoundError: scala/Product$class