Open am0eba-byte opened 1 month ago
Can you please report the exact versions of the libraries in your environment?
I guess you are running on Scala 2.12? In such case you should use ArangoDB Spark Connector for Scala 2.12: com.arangodb:arangodb-spark-datasource-3.3_2.12
.
We have switched over to using the correct connector version com.arangodb:arangodb-spark-datasource-3.3_2.12
as you've suggested, and we are now seeing a slightly different error:
ExceptionErrorMessage failureReason: An error occurred while calling o121.load. org.apache.spark.sql.sources.DataSourceRegister: Provider com.arangodb.spark.DefaultSource not found
Note that the jar file of ArangoDB Spark Datasource does not bundle all its dependencies. Adding this single jar file is not enough. You should let Spark fetch it from Maven along with all its transitive dependencies.
E.g.:
spark = SparkSession.builder \
.config('spark.jars.packages', 'com.arangodb:arangodb-spark-datasource-3.3_2.12:1.8.0') \
...
.getOrCreate()
Or submit with:
spark-submit \
...
--packages="com.arangodb:arangodb-spark-datasource-3.3_2.12:1.8.0"
We have tried creating the Spark session within Glue as you've described, but we are still seeing the same error.
For reference, this is how we're building the spark session within Glue:
glueContext = GlueContext(sc)
# spark = glueContext.spark_session
spark = SparkSession.builder.appName("ArangoDBPySparkData").config("spark.jars.packages", f"com.arangodb:arangodb-spark-datasource-3.3_2.12-1.8.0").getOrCreate()
Perhaps we need to put the dependent jar files within S3, and provide those S3 paths to the Glue job? Where can we find those dependencies and their .jar files?
Thank you for your responsiveness @rashtao - We've already tried getting support from folks at AWS, but it seems they are stumped on our issue as well.
Please note that there is an error in the configuration value of spark.jars.packages
that you reported above, it should be com.arangodb:arangodb-spark-datasource-3.3_2.12:1.8.0
(GAV coordinates) and not com.arangodb:arangodb-spark-datasource-3.3_2.12-1.8.0
.
As you can see, this usage scenario is tested in the demo. You might want to investigate further with AWS support why transitive dependencies are not resolved in your AWS Glue application.
If you prefer to work around it by manually adding all the jars of the transitive dependencies, then you should include these:
com.arangodb:arangodb-java-driver-shaded:7.9.0
com.arangodb:arangodb-spark-commons-3.3_2.12:1.8.0
com.arangodb:arangodb-spark-datasource-3.3_2.12:1.8.0
com.arangodb:jackson-dataformat-velocypack:4.3.0
com.arangodb:velocypack:3.1.0
Setup:
Description:
Trying to set up an ETL pipeline to read a collection from our Arango Oasis instance, and we keep running into the same error. We're performing the ETL job in AWS Glue. We're connecting to the database through a NAT Gateway. We based our Glue job script off of the python demo
arangodb-spark-datasource/demo/python-demo/demo.py
, and we're providing the DB credentials via SecretsManager. Here's what our python code looks like:We believe the error is occuring on
.load()
, and we're wondering if anyone else has run into the same error when trying to setup a connection within AWS Glue, or if anyone has any tips for us to try: