databricks / spark-redshift

Redshift data source for Apache Spark
Apache License 2.0
607 stars 349 forks source link

Getting error while writing data to Redshift. S3 bucket lifecycle configuration, java.lang.IllegalArgumentException: Cannot create enum from ap-south-1 value! #368

Open ramanathanramaiyah opened 7 years ago

ramanathanramaiyah commented 7 years ago

Redshift instance and S3 bucket are in ap-south-1. Simply reading a file from S3 and writing it to Redshift. Here is the code:

--Create spark context sc sc.hadoopConfiguration.set("fs.s3a.access.key", "<<>>") sc.hadoopConfiguration.set("fs.s3a.secret.key", "<<>>")

val df = <>; df.write.format("com.databricks.spark.redshift").option("url", "?user=<<>>&password=<<>>").option("dbtable", "public.xxx").option("tempdir", "s3a://bucketname/folder").mode(SaveMode.Append).save()

SBT dependency:

scalaVersion := "2.10.5" libraryDependencies += "com.databricks" %% "spark-redshift" % "1.1.0" libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.210" libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.210"

Adding Redshift JDBC jar as --jars option in spark-submit.

Error:

WARN Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.IllegalArgumentException: Cannot create enum from ap-south-1 value! at com.amazonaws.regions.Regions.fromName(Regions.java:71)

vetional commented 7 years ago

getting the same thing with

libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.224" too.

Any progress on this?

vnktsh commented 7 years ago

This occurs mostly because of dependency issue. Both hadoop-aws and aws-sdk have to be compatible.

vetional commented 7 years ago

@vnktsh Where can I find which version is compatible with which one? Shouldn't the latest builds of both be compatible?

I have the following build.sbt

version := "1.0"

scalaVersion := "2.11.8"

// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.0"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.2.0"

// https://mvnrepository.com/artifact/org.apache.spark/spark-mllib
libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "2.2.0" % "provided"

// https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.8.2"

// https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-redshift
libraryDependencies += "com.amazonaws" % "aws-java-sdk-redshift" % "1.11.225"

// https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11
libraryDependencies += "com.databricks" % "spark-avro_2.11" % "4.0.0"

// https://mvnrepository.com/artifact/com.databricks/spark-redshift_2.11
libraryDependencies += "com.databricks" % "spark-redshift_2.11" % "3.0.0-preview1"

// https://mvnrepository.com/artifact/com.eclipsesource.minimal-json/minimal-json
libraryDependencies += "com.eclipsesource.minimal-json" % "minimal-json" % "0.9.4"

// https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector
libraryDependencies += "org.mongodb.spark" % "mongo-spark-connector_2.11" % "2.2.0"

// https://mvnrepository.com/artifact/org.mongodb/mongo-java-driver
libraryDependencies += "org.mongodb" % "mongo-java-driver" % "3.4.3"

// https://mvnrepository.com/artifact/org.mongodb.mongo-hadoop/mongo-hadoop-spark
libraryDependencies += "org.mongodb.mongo-hadoop" % "mongo-hadoop-spark" % "2.0.2"
vnktsh commented 7 years ago

@vetional : Try with hadoop 2.7.3, Don't include aws-sdk-core explicitly, hadoop-aws has compile dependency. Use following mvn template(including exclusions) to adapt for you sbt.

I would start by minimizing the code until problem solves, probably remove mongo related dependencies, include redhisft jdbc either as jar or as dependency in sbt.

TIP: Always check mvn repo website to see the compiled dependencies and exclude duplicate versions if it's in conflict with other dependencies in your sbt/pom.xml

  <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <version>2.7.3</version>
      <exclusions>
          <exclusion>
              <groupId>org.apache.hadoop</groupId>
              <artifactId>hadoop-common</artifactId>
          </exclusion>
          <exclusion>
              <groupId>com.fasterxml.jackson.core</groupId>
              <artifactId>jackson-databind</artifactId>
          </exclusion>
          <exclusion>
              <groupId>com.fasterxml.jackson.core</groupId>
              <artifactId>jackson-core</artifactId>
          </exclusion>
          <exclusion>
              <groupId>com.fasterxml.jackson.core</groupId>
              <artifactId>jackson-annotations</artifactId>
          </exclusion>
          <exclusion>
              <groupId>javax.servlet</groupId>
              <artifactId>servlet-api</artifactId>
          </exclusion>
          <exclusion>
              <groupId>javax.servlet.jsp</groupId>
              <artifactId>jsp-api</artifactId>
          </exclusion>
          <exclusion>
              <groupId>org.mortbay.jetty</groupId>
              <artifactId>servlet-api</artifactId>
          </exclusion>
      </exclusions>
  </dependency>

Your final sbt should look something like this:

version := "1.0" scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.0" libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.2.0" libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "2.2.0" % "provided" libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3" with all the exclusions from above template. libraryDependencies += "com.databricks" % "spark-avro_2.11" % "4.0.0" libraryDependencies += "com.databricks" % "spark-redshift_2.11" % "3.0.0-preview1" libraryDependencies += "com.eclipsesource.minimal-json" % "minimal-json" % "0.9.4" //libraryDependencies += possible dependency for redshift jdbc ...