awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 299 forks source link

Failed to find data source: com.azure.cosmos.spark #174

Open Y-H-Lai opened 1 year ago

Y-H-Lai commented 1 year ago

Hi, I am quite new to AWS Glue and Apache Spark connectors, so sorry if I direct the question to the wrong party.

My team is trying to migrate from Azure CosmosDb to AWS DynamoDb. So now, we are trying to get data from CosmosDb through AWS Glue before storing the data in AWS S3 and DynamoDb by following this article: https://aws.amazon.com/blogs/database/migrate-from-azure-cosmos-db-to-amazon-dynamodb-using-aws-glue/

I have IAM role, Glue connector, connection, job and other required resources setup as shown in the article above.

However, when I tried to run the job, I got error message of "An error occurred while calling o87.getSource. Failed to find data source: com.azure.cosmos.spark. Please find packages at http://spark.apache.org/third-party-projects.html".

May I know how to resolve this issue?

The log details is as shown below.

23/02/06 10:06:18 ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] { "Event": "GlueETLJobExceptionEvent", "Timestamp": 1675677978748, "Failure Reason": "Traceback (most recent call last):\n File \"/tmp/test-migrate-cosmosdb-to-s3-loko.py\", line 60, in <module>\n transformation_ctx=\"testcosmosdbtos3loko_node1\",\n File \"/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py\", line 770, in from_options\n format_options, transformation_ctx, push_down_predicate, **kwargs)\n File \"/opt/amazon/lib/python3.6/site-packages/awsglue/context.py\", line 232, in create_dynamic_frame_from_options\n source = self.getSource(connection_type, format, transformation_ctx, push_down_predicate, **connection_options)\n File \"/opt/amazon/lib/python3.6/site-packages/awsglue/context.py\", line 105, in getSource\n makeOptions(self._sc, options), transformation_ctx, push_down_predicate)\n File \"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\", line 1305, in __call__\n answer, self.gateway_client, self.target_id, self.name)\n File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 111, in deco\n return f(*a, **kw)\n File \"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py\", line 328, in get_return_value\n format(target_id, \".\", name), value)\npy4j.protocol.Py4JJavaError: An error occurred while calling o87.getSource.\n: java.lang.ClassNotFoundException: Failed to find data source: com.azure.cosmos.spark. Please find packages at http://spark.apache.org/third-party-projects.html\n\tat org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)\n\tat org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:743)\n\tat org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)\n\tat org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)\n\tat com.amazonaws.services.glue.marketplace.connector.CustomDataSourceFactory$.loadSparkDataSource(CustomDataSourceFactory.scala:89)\n\tat com.amazonaws.services.glue.marketplace.connector.CustomDataSourceFactory$.loadDataSource(CustomDataSourceFactory.scala:33)\n\tat com.amazonaws.services.glue.GlueContext.getCustomSource(GlueContext.scala:163)\n\tat com.amazonaws.services.glue.GlueContext.getCustomSourceWithConnection(GlueContext.scala:476)\n\tat com.amazonaws.services.glue.GlueContext.getSourceInternal(GlueContext.scala:965)\n\tat com.amazonaws.services.glue.GlueContext.getSource(GlueContext.scala:776)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:750)\nCaused by: java.lang.ClassNotFoundException: com.azure.cosmos.spark.DefaultSource\n\tat java.net.URLClassLoader.findClass(URLClassLoader.java:387)\n\tat java.lang.ClassLoader.loadClass(ClassLoader.java:418)\n\tat sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)\n\tat java.lang.ClassLoader.loadClass(ClassLoader.java:351)\n\tat org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:663)\n\tat scala.util.Try$.apply(Try.scala:209)\n\tat org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:663)\n\tat scala.util.Failure.orElse(Try.scala:220)\n\tat org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:663)\n\t... 20 more\n", "Stack Trace": [ { "Declaring Class": "get_return_value", "Method Name": "format(target_id, \".\", name), value)", "File Name": "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", "Line Number": 328 }, { "Declaring Class": "deco", "Method Name": "return f(*a, **kw)", "File Name": "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", "Line Number": 111 }, { "Declaring Class": "__call__", "Method Name": "answer, self.gateway_client, self.target_id, self.name)", "File Name": "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", "Line Number": 1305 }, { "Declaring Class": "getSource", "Method Name": "makeOptions(self._sc, options), transformation_ctx, push_down_predicate)", "File Name": "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", "Line Number": 105 }, { "Declaring Class": "create_dynamic_frame_from_options", "Method Name": "source = self.getSource(connection_type, format, transformation_ctx, push_down_predicate, **connection_options)", "File Name": "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", "Line Number": 232 }, { "Declaring Class": "from_options", "Method Name": "format_options, transformation_ctx, push_down_predicate, **kwargs)", "File Name": "/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py", "Line Number": 770 }, { "Declaring Class": "<module>", "Method Name": "transformation_ctx=\"testcosmosdbtos3loko_node1\",", "File Name": "/tmp/test-migrate-cosmosdb-to-s3-loko.py", "Line Number": 60 } ], "Last Executed Line number": 60, "script": "test-migrate-cosmosdb-to-s3-loko.py" }