awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 300 forks source link

DynamicFrame rename_field results in data loss when renaming the only child of a struct #125

Open mwoods-familiaris opened 2 years ago

mwoods-familiaris commented 2 years ago

Attempting to rename a child field within a struct only works when the struct has at least 2 child fields. Any attempt to rename the sole child field of a struct results in the complete loss of the record (and thus, the dynamic frame schema with it).

Steps to recreate:

from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from pyspark.sql.types import StructType,StructField, StringType

glue_context = GlueContext(SparkContext())

schema = StructType([
    StructField(
        'struct_outer',
        StructType([
            StructField('struct_inner', StringType(), True),
        ]),
    ),
])
df = glue_context.spark_session.createDataFrame(data=[(('value',),)], schema=schema)
dynamic_frame = DynamicFrame.fromDF(df, glue_context, "test")
print('BEFORE RENAME:')
dynamic_frame.printSchema()
dynamic_frame.show()
print('count: {}'.format(dynamic_frame.count()))
dynamic_frame = dynamic_frame.rename_field('struct_outer.struct_inner', 'struct_outer.struct_inner_new')
print('AFTER RENAME:')
dynamic_frame.printSchema()
dynamic_frame.show()
print('count: {}'.format(dynamic_frame.count()))

...results in...

BEFORE RENAME:
root
|-- struct_outer: struct
|    |-- struct_inner: string

{"struct_outer": {"struct_inner": "value"}}
count: 1
AFTER RENAME:
root

count: 0

Compare to successful rename when the struct contains more than 1 child field:

from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from pyspark.sql.types import StructType,StructField, StringType

glue_context = GlueContext(SparkContext())

schema = StructType([
    StructField(
        'struct_outer',
        StructType([
            StructField('struct_inner', StringType(), True),
            StructField('struct_inner2', StringType(), True),
        ]),
    ),
])
df = glue_context.spark_session.createDataFrame(data=[(('value', 'value'),)], schema=schema)
dynamic_frame = DynamicFrame.fromDF(df, glue_context, "test")
print('BEFORE RENAME:')
dynamic_frame.printSchema()
dynamic_frame.show()
print('count: {}'.format(dynamic_frame.count()))
dynamic_frame = dynamic_frame.rename_field('struct_outer.struct_inner', 'struct_outer.struct_inner_new')
print('AFTER RENAME:')
dynamic_frame.printSchema()
dynamic_frame.show()
print('count: {}'.format(dynamic_frame.count()))

...results in...

BEFORE RENAME:
root
|-- struct_outer: struct
|    |-- struct_inner: string
|    |-- struct_inner2: string

{"struct_outer": {"struct_inner": "value", "struct_inner2": "value"}}
count: 1
AFTER RENAME:
root
|-- struct_outer: struct
|    |-- struct_inner2: string
|    |-- struct_inner_new: string

{"struct_outer": {"struct_inner2": "value", "struct_inner_new": "value"}}
count: 1
whimzyLive commented 1 year ago

Any updates on this? !!!!!!

ballinas commented 5 months ago

In my case I set both names old and new between backticks (`)

newDyF = oldDyF.rename_field("``this.old.name``", "``this.new.name``")

You all can find more details in :

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-rename_field