arangodb / arangodb-spark-datasource

ArangoDB Connector for Apache Spark, using the Spark DataSource API
Apache License 2.0
14 stars 11 forks source link

AWS Glue - Arango Oasis Spark Connection unclassified error on read_collection load "An error occurred while calling o120.load. org/apache/spark/sql/arangodb/commons/ArangoDBConf$" #64

Open am0eba-byte opened 1 month ago

am0eba-byte commented 1 month ago

Setup:

ArangoGraph Oasis 3.11 (oneshard model, 3 x 4GB)
AWS Glue 4.0 - Spark 3.3, Scala 2, Python 3
ArangoDB Spark Connector [version 1.7.0](https://mvnrepository.com/artifact/com.arangodb/arangodb-spark-datasource-3.3_2.13-1.7.0.jar)

Description:

Trying to set up an ETL pipeline to read a collection from our Arango Oasis instance, and we keep running into the same error. We're performing the ETL job in AWS Glue. We're connecting to the database through a NAT Gateway. We based our Glue job script off of the python demo arangodb-spark-datasource/demo/python-demo/demo.py, and we're providing the DB credentials via SecretsManager. Here's what our python code looks like:

def read_collection(spark: SparkSession, collection_name: str, base_opts: dict[str, str], schema: StructType) -> pyspark.sql.DataFrame:
    arangodb_datasource_options = combine_dicts([base_opts, {"table": collection_name}])

    return spark.read \
        .format("com.arangodb.spark") \
        .options(**arangodb_datasource_options) \
        .schema(schema) \
        .load() #fails here

We believe the error is occuring on .load(), and we're wondering if anyone else has run into the same error when trying to setup a connection within AWS Glue, or if anyone has any tips for us to try:

ExceptionErrorMessage failureReason: An error occurred while calling o121.load. org/apache/spark/sql/arangodb/commons/ArangoDBConf$
24/09/30 16:27:46 INFO AmazonHttpClient: Configuring Proxy. Proxy Host: 169.254.76.0 Proxy Port: 8888
24/09/30 16:27:46 INFO ProcessLauncher: Enhance failure reason and emit cloudwatch error metrics.
24/09/30 16:27:46 INFO ProcessLauncher: postprocessing
24/09/30 16:27:46 WARN OOMExceptionHandler: Failed to extract executor id from error message.
24/09/30 16:27:46 ERROR ProcessLauncher: Error from Python:Traceback (most recent call last):
  File "/tmp/ArangoDB_CompGraph_ETL.py", line 99, in <module>
    collection = read_collection(spark, "competency_transitive", arango_options, edges_schema)
  File "/tmp/ArangoDB_CompGraph_ETL.py", line 97, in read_collection
    .load()
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 184, in load
    return self._df(self._jreader.load())
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o121.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/arangodb/commons/ArangoDBConf$
    at com.arangodb.spark.DefaultSource.extractOptions(DefaultSource.scala:16)
    at com.arangodb.spark.DefaultSource.getTable(DefaultSource.scala:38)
    at com.arangodb.spark.DefaultSource.getTable(DefaultSource.scala:31)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:83)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:132)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
    at scala.Option.flatMap(Option.scala:271)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.arangodb.commons.ArangoDBConf$
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 21 more
rashtao commented 1 month ago

Can you please report the exact versions of the libraries in your environment?

am0eba-byte commented 1 month ago
rashtao commented 1 month ago

I guess you are running on Scala 2.12? In such case you should use ArangoDB Spark Connector for Scala 2.12: com.arangodb:arangodb-spark-datasource-3.3_2.12.

am0eba-byte commented 1 month ago

We have switched over to using the correct connector version com.arangodb:arangodb-spark-datasource-3.3_2.12 as you've suggested, and we are now seeing a slightly different error:

ExceptionErrorMessage failureReason: An error occurred while calling o121.load. org.apache.spark.sql.sources.DataSourceRegister: Provider com.arangodb.spark.DefaultSource not found
rashtao commented 1 month ago

Note that the jar file of ArangoDB Spark Datasource does not bundle all its dependencies. Adding this single jar file is not enough. You should let Spark fetch it from Maven along with all its transitive dependencies.

E.g.:

spark = SparkSession.builder \
    .config('spark.jars.packages', 'com.arangodb:arangodb-spark-datasource-3.3_2.12:1.8.0') \
    ...
    .getOrCreate()

Or submit with:

spark-submit \
    ...
    --packages="com.arangodb:arangodb-spark-datasource-3.3_2.12:1.8.0"
am0eba-byte commented 1 month ago

We have tried creating the Spark session within Glue as you've described, but we are still seeing the same error.

For reference, this is how we're building the spark session within Glue:

glueContext = GlueContext(sc)
# spark = glueContext.spark_session

spark = SparkSession.builder.appName("ArangoDBPySparkData").config("spark.jars.packages", f"com.arangodb:arangodb-spark-datasource-3.3_2.12-1.8.0").getOrCreate()

Perhaps we need to put the dependent jar files within S3, and provide those S3 paths to the Glue job? Where can we find those dependencies and their .jar files?

Thank you for your responsiveness @rashtao - We've already tried getting support from folks at AWS, but it seems they are stumped on our issue as well.

rashtao commented 1 month ago

Please note that there is an error in the configuration value of spark.jars.packages that you reported above, it should be com.arangodb:arangodb-spark-datasource-3.3_2.12:1.8.0 (GAV coordinates) and not com.arangodb:arangodb-spark-datasource-3.3_2.12-1.8.0. As you can see, this usage scenario is tested in the demo. You might want to investigate further with AWS support why transitive dependencies are not resolved in your AWS Glue application. If you prefer to work around it by manually adding all the jars of the transitive dependencies, then you should include these: