aws-samples / aws-big-data-blog-dmscdc-walkthrough

MIT No Attribution
32 stars 16 forks source link

Performance improvement #2

Open jcunhafonte opened 4 years ago

jcunhafonte commented 4 years ago

This is not an issue but probably a script adjustment.

Would the performance of the script improve if we would read the parquet files with Glue API: input = glueContext.create_dynamic_frame_from_options("s3", connection_options={"path": path}, format="parquet", transformation_ctx="input").toDF().withColumn("Op", lit("I"))

Instead: input = spark.read.parquet(path).withColumn("Op", lit("I"))

Thank you.