getsentry / sentry-python

The official Python SDK for Sentry.io
https://sentry.io/for/python/
MIT License
1.87k stars 489 forks source link

Failed to initialize SparkIntegration #3161

Closed seyoon-lim closed 2 months ago

seyoon-lim commented 3 months ago

How do you use Sentry?

Sentry Saas (sentry.io)

Version

2.5.1

Steps to Reproduce

Hello,

I've encountered an issue when using SparkIntegration with my PySpark application. I was following the guide specified at Spark Driver Integration Documentation and experienced the following AttributeError:

sc._jsc.sc().addSparkListener(listener)
E   AttributeError: 'SparkContext' object has no attribute '_jsc'

Upon investigating, it seems that the issue may stem from the code at sentry-python/spark_driver.py#L50. The sc._jsc attribute is set after the SparkContext is initialized, as seen in apache/spark/pyspark/context.py#L296.

Consequently, _start_sentry_listener and _set_app_properties referenced at spark_driver.py#L62-L63 should ideally be invoked after spark_context_init is executed.

I have tested this modification using both local and yarn Spark masters, and fixed version in my repo appears to be functioning correctly.

this is my test code

def test_initialize_spark_integration(sentry_init):
    # fail with the code: https://github.com/getsentry/sentry-python/blob/2.5.1/sentry_sdk/integrations/spark/spark_driver.py#L53
    # success with the code: https://github.com/seyoon-lim/sentry-python/blob/fix-spark-driver-integration/sentry_sdk/integrations/spark/spark_driver.py#L53
    sentry_init(integrations=[SparkIntegration()])
    SparkContext.getOrCreate()

Looking forward to your feedback and suggestions for addressing this issue.

Thank you!

Expected Result

from pyspark.sql import SparkSession
import sentry_sdk
from sentry_sdk.integrations.spark import SparkIntegration

if __name__ == "__main__":
    sentry_sdk.init(
        dsn=matrix_dsn,
        integrations=[SparkIntegration()],
    )

    spark = SparkSession.builder.getOrCreate()
    ...

Actual Result

Traceback (most recent call last):
  File "/Users/kakao/Desktop/shaun/workplace/my-repos/du-batch/entrypoint.py", line 17, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/Users/kakao/Desktop/shaun/workplace/my-repos/du-batch/venv/lib/python3.9/site-packages/pyspark/sql/session.py", line 477, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/Users/kakao/Desktop/shaun/workplace/my-repos/du-batch/venv/lib/python3.9/site-packages/pyspark/context.py", line 514, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/Users/kakao/Desktop/shaun/workplace/my-repos/du-batch/venv/lib/python3.9/site-packages/pyspark/context.py", line 201, in __init__
    self._do_init(
  File "/Users/kakao/Desktop/shaun/workplace/my-repos/du-batch/venv/lib/python3.9/site-packages/sentry_sdk/utils.py", line 1710, in runner
    return sentry_patched_function(*args, **kwargs)
  File "/Users/kakao/Desktop/shaun/workplace/my-repos/du-batch/venv/lib/python3.9/site-packages/sentry_sdk/integrations/spark/spark_driver.py", line 69, in _sentry_patched_spark_context_init
    _start_sentry_listener(self)
  File "/Users/kakao/Desktop/shaun/workplace/my-repos/du-batch/venv/lib/python3.9/site-packages/sentry_sdk/integrations/spark/spark_driver.py", line 55, in _start_sentry_listener
    sc._jsc.sc().addSparkListener(listener)
AttributeError: 'SparkContext' object has no attribute '_jsc'
sentrivana commented 3 months ago

Thanks for all the research you put into this @seyoon-lim and for the PR! We will take a look.