databricks / spark-redshift

Redshift data source for Apache Spark
Apache License 2.0
605 stars 349 forks source link

Writing to Redshift when S3 bucket policy requires encryption #321

Open data-compute-geek opened 7 years ago

data-compute-geek commented 7 years ago

I get the following error when I try to write a dataframe to a redshift table in pyspark using df.write.

.... Caused by: com.amazon.support.exceptions.ErrorException: Amazon Invalid operation: Manifest file is not in correct json format ....

My EMR is set up with a client side encryption materials provider, and it looks like the manifest file is automatically encrypted. Is there any way around this issue?

Thanks a lot.

svadakattu commented 7 years ago

I am also having same issue when I tried to unload a Redshift table using Redshift connector

error: Unable to upload manifest file - S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid BEC36007FAA4C22A,ExtRid XDvccdTuoBmu79AkG9IXc3QOjt3IV0LNsFlpdc6vyKWHD82fnGX95rAEhwIP8ape4Q58ENj4KHs=,CanRetry 1

I am using below version of connector:

       <dependency>
            <groupId>com.databricks</groupId>
            <artifactId>spark-redshift_2.11</artifactId>
            <version>2.0.1</version>
        </dependency>

Simple test code below:


    val spark = SparkSession.builder().appName("RS Connect").master(masterURL).getOrCreate()
    val hadoopConf = spark.sparkContext.hadoopConfiguration
    hadoopConf.set("fs.s3.enableServerSideEncryption", "true")
    hadoopConf.set("fs.s3a.server-side-encryption-algorithm", "AES256")

    val testRedshift = spark.sqlContext.read
      .format("com.databricks.spark.redshift")
      .option("url", jdbcURL)
      .option("tempdir", s3URL)
    .option("query", "select * from svv_table_info")
    .load()

    testRedshift.createOrReplaceTempView("testRs")
    spark.sql("select * from testRS").show()
NeilJonkers commented 7 years ago

svadakattu@ can you please check if issue #309 applies to your use case.

Thank you

JoshRosen commented 7 years ago

There are two separate issues here:

I have some thoughts on fixing the read path which I'll post in #309 to keep the two discussions cleanly separated.

hasamro commented 6 years ago

I s there a work around this issue in the meanwhile, I am seeiNg the same error ?

jackwellsxyz commented 4 years ago

Hi @JoshRosen - do you know if this bug has been fixed? I'm getting the following error in Spark:

-----------------------------------------------
  error:  S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid 14221573B79D6B82,ExtRid Lyd7DMBYKe/NyAhsoyeM+XmUo0mTvXX7YU4LofBZWewjt5YFE0zLNc2jmkt7fkMlJamndeaDj00=,CanRetry 1
  code:      8001
  context:   Failed to initialize S3 output stream. S3 path: s3://<bucket>/temp/4618d0ce-615b-4930-b6d8-3173501f0883/12812_part_00
  query:     4936961
  location:  s3_text_unloader.cpp:364
  process:   padbmaster [pid=29738]
  -----------------------------------------------;

Here's how I'm adding options and doing a dataframe read (using Databricks):

sc._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption-algorithm", "SSE-KMS")
sc._jsc.hadoopConfiguration().set("fs.s3a.enableServerSideEncryption", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption.key", "<kms key hex value>")

df = (spark.read
      .format("com.databricks.spark.redshift")
      .option("url", f"jdbc:redshift://database:1234/schema?user=username&password=password&ssl=true")
      .option("query", "select <query here>'")
      .option("tempdir", "s3a://<bucket>/temp")
      .option("aws_iam_role", "<iam arn>")
      .load()
      .limit(20))

I loaded RedshiftJDBC42_no_awssdk_1_2_45_1069.jar library into the Spark cluster. I'd note that I get different errors if I leave off the ssl=true in the JDBC connection string, mistype the IAM role, or the tempdir, so I'm pretty sure it's an issue with the SSE-KMS encryption. I verified this query works through a database query console.

mithun12000 commented 4 years ago

Hi @JoshRosen ,

Is there any possibility to pass KMS key as option parameter and this KMS key also can be passed to unload query since it is only required to write to S3 location.

So unload query can be look like

unload ([query]) to '[S3 path]' iam_role '[IAM]' KMS_KEY_ID '[Key]' encrypted

Also I see that hadoop configuration only consider for AWS credential which will not consider KMS key while prepare unload query.

krish-at-github commented 3 years ago

Hi, We are trying to use databricks spark connector for Redshift in EMR. on our cluster we are required to encrypt the files on S3 using KMS key. Due to this policy read operation in Spark is failing when Redshift driver tries to unload the data on to S3 bucket. I Could not find any way to pass KMS key to databricks spark connector. Please let me know if there is a plan to incorporate the option to pass KMS key and encrypt options for load operation call in spark.