LucaCanali / sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
Apache License 2.0
690 stars 144 forks source link

bytesRead not populated in some cases #60

Closed snowch closed 2 months ago

snowch commented 2 months ago

I'm hoping to use sparkmeasure to compare performance of reading data from S3 vs reading from an optimised database.

CSV on S3 works as expected:

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)

stagemetrics.begin()
spark.read.csv(CUSTOMERS_S3_URL, header=False, inferSchema=True).filter("_c3 = 'FEMALE'").show()
stagemetrics.end()
# stagemetrics.print_report()

metrics = stagemetrics.aggregate_stagemetrics()
print(f"""
{metrics['recordsRead'] = }
{metrics['bytesRead'] = }
""")

This reports:

metrics['recordsRead'] = 19022
metrics['bytesRead'] = 1107894

The Database query:

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)

stagemetrics.begin()
spark.sql(f"""
    SELECT * 
    FROM ndb.`{DATABASE_NAME}`.`{DATABASE_SCHEMA}`.`{CUSTOMERS_TABLENAME}`
    WHERE cust_gender = 'FEMALE'
""").show()
stagemetrics.end()
# stagemetrics.print_report()

metrics = stagemetrics.aggregate_stagemetrics()
print(f"""
{metrics['recordsRead'] = }
{metrics['bytesRead'] = }
""")

For some reason this is reporting zero bytes read.

metrics['recordsRead'] = 9334
metrics['bytesRead'] = 0

Should I be looking into the DB plugin code to debug this? https://github.com/vast-data/vast-db-connectors

LucaCanali commented 2 months ago

Hi, Thank you for sharing this interesting use case. SparkMeasure only captures the metrics provided by the Spark Listener. As you mentioned, it might be worth investigating if this issue can be resolved at the DB plugin level.

snowch commented 2 months ago

Thanks!