awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
641 stars 303 forks source link

'SparkSession' object has no attribute 'serializer' #124

Closed goldengrisha closed 2 years ago

goldengrisha commented 2 years ago

Please help, I use aws-glue-libs:glue_libs_3.0.0_image_01 from docker hub, and all the time I get errors like:

An error was encountered:
'SparkSession' object has no attribute 'serializer'
Traceback (most recent call last):
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/transforms/transform.py", line 24, in apply
    return transform(*args, **kwargs)
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/transforms/dynamicframe_filter.py", line 18, in __call__
    return frame.filter(f, transformation_ctx, info, stageThreshold, totalThreshold)
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 94, in filter
    return self.mapPartitions(func, True, transformation_ctx, info, stageThreshold, totalThreshold)
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 99, in mapPartitions
    return self.mapPartitionsWithIndex(func, preservesPartitioning, transformation_ctx, info, stageThreshold, totalThreshold)
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 122, in mapPartitionsWithIndex
    PipelinedRDD(self._rdd, f, preservesPartitioning)._jrdd, self.glue_ctx._ssql_ctx, transformation_ctx, self.name,
  File "/home/glue_user/spark/python/pyspark/rdd.py", line 2929, in __init__
    self._jrdd_deserializer = self.ctx.serializer
AttributeError: 'SparkSession' object has no attribute 'serializer'
moomindani commented 2 years ago

We apologize for delay. Cloud you please share the reproduction step? I was not able to reproduce this issue.

goldengrisha commented 2 years ago

Hello, all is ok, it was resolved. Thank you.

GroovyDan commented 2 years ago

@goldengrisha I am running into this same issue, what did you do to resolve it?

GroovyDan commented 2 years ago

Never mind, I figured it out. It had to do with how I was creating the Glue Context:

from awsglue.context import GlueContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import (
    StructField,
    StructType,
    StringType,
)

def _add_column(rec):
    rec["pk"] = "1"
    return rec

def test_serializer_error():
    # WRONG WAY! Will throw error
    glue_context = GlueContext(SparkSession.builder.getOrCreate())

    # CORRECT_WAY! Pass in Spark Context
    glue_context = GlueContext(SparkSession.builder.getOrCreate().sparkContext)

    dyf = glue_context.create_dynamic_frame.from_rdd(
        data=[("test",)],
        name="DynamicFrame",
        schema=StructType(
            [
                StructField("test", StringType(), True),
            ]
        ),
    )
    mapped_dyf = dyf.map(f=_add_column)

The documentation for using the docker image found here has examples of creating the Glue Context with the first method. This is confusing since that will cause this error.