LucaCanali / sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
Apache License 2.0
690 stars 144 forks source link

throwing error when trying to make work locally #38

Closed hardiktalati closed 2 years ago

hardiktalati commented 2 years ago

File "C:\Users\talathar\Miniconda3\envs\XXXXX\lib\site-packages\sparkmeasure\stagemetrics.py", line 15, in init self.stagemetrics = self.sc._jvm.ch.cern.sparkmeasure.StageMetrics(self.sparksession._jsparkSession) File "C:\spark-3.2.1-bin-hadoop3.2\python\lib\py4j-0.10.9.3-src.zip\py4j\java_gateway.py", line 1586, in call File "C:\spark-3.2.1-bin-hadoop3.2\python\pyspark\sql\utils.py", line 111, in deco return f(*a, **kw) File "C:\spark-3.2.1-bin-hadoop3.2\python\lib\py4j-0.10.9.3-src.zip\py4j\protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.ch.cern.sparkmeasure.StageMetrics. : java.lang.NoClassDefFoundError: scala/Product$class

LucaCanali commented 2 years ago

I see you are using sparkmeasure from PySpark. Can you please try the full example in the README.md and let me know if you still see an error?

# Python CLI, Spark 3.x
pip install sparkmeasure
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(100)").show()')
hardiktalati commented 2 years ago

Hey Luca,

It worked on my work laptop - i am trying on my personal and its behaving again in sameway.

from pyspark.sql import SparkSession import os os.environ['SPARK_HOME'] = "C:\spark-3.2.1-bin-hadoop3.2" spark = SparkSession.builder\ .config("spark.jars.packages", "ch.cern.sparkmeasure:spark-measure_2.12:0.17").getOrCreate()

from sparkmeasure import StageMetrics stagemetrics = StageMetrics(spark)

ERROR raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling None.ch.cern.sparkmeasure.StageMetrics. : java.lang.NoClassDefFoundError: scala/Product$class

LucaCanali commented 2 years ago

I would like to check which version of Spark you are using. I suspect you may be using Spark 2.x with scala 2.11. Can you please run: spark.version ? Instead of setting SPARK_HOME I'd suggest the use of findspark:

import findspark
findspark.init("SPARK_HOME_HERE")

BTW I have since released sparkmeasure version 0.18.

hardiktalati commented 2 years ago

Hey Luca, Its working now.. Thanks for the help. It was issue with my java version or when i upgraded to latest sparkmeasure

LucaCanali commented 2 years ago

Thanks for the feedback.

mccartni-aws commented 7 months ago

Hey @hardiktalati What version of SparkMeasure, Java and pyspark worked for you?

I'm running sparkmeasure==0.23.0, pyspark==3.3 and java==11 and I'm getting the following error: py4j.protocol.Py4JError: ch.cern.sparkmeasure.TaskMetrics does not exist in the JVM.

@LucaCanali Your input here would be greatly appreciated too.

Many thanks both, in advanced.

LucaCanali commented 7 months ago

Hi @mccartni-aws does something like this work for you?

# Example of how to use sparkMeasure from the Spark Python CLI, Spark 3.x
# install the Python wrapper
pip install sparkmeasure 
# run PySpark with sparkMeasure library jars
# Note: if running on a notebook or Python directly, instead of PySpark, make sure to have the config spark.jars.packages set to ch.cern.sparkmeasure:spark-measure_2.12:0.23 
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 

# StageMetrics example
from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(100)").show()')

# TaskMetrics example
from sparkmeasure import TaskMetrics
taskmetrics = TaskMetrics(spark)
taskmetrics.runandmeasure(locals(), 'spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(100)").show()')
mccartni-aws commented 7 months ago

Thanks a lot @LucaCanali.

I'm running it from a SageMaker processing job, not from a notebook. I will try it out now in a notebook, though this will not be my end use-case.

I'm wondering whether it's anything to do with me not setting the correct spark path and Py4j path.

Do these versions look good to you? sparkmeasure==0.23.0, pyspark==3.3 and java==11

LucaCanali commented 7 months ago

I confirm that sparkMeasure works OK for me with sparkmeasure==0.23.0, pyspark/spark==3.3.x and java==11 BTW, no need to run on a notebook, running from PySpark CLI is even easier to test.