awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 300 forks source link

Issue with create_dynamic_frame.from_catalog #152

Closed DenisMurakhovskiy closed 1 year ago

DenisMurakhovskiy commented 1 year ago

I am using Glue Docker image from https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/ Here is my command to start JupyterLab (Windows):

docker run -it -v //c/.....aws:/home/glue_user/.aws -v //c/Users/..../JUPYTER/:/home/glue_user/workspace/jupyter_workspace/ -e AWS_PROFILE="my_profile" -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 -p 8998:8998 -p 8888:8888 --name glue_jupyter_lab amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/glue_user/jupyter/jupyter_start.sh

I am getting this

starting org.apache.spark.deploy.history.HistoryServer, logging to /home/glue_user/spark/logs/spark-glue_user-org.apache.spark.deploy.history.HistoryServer-x-xxxxxxxx.out
starting java  -cp /home/glue_user/livy/jars/*:/home/glue_user/livy/conf:/home/glue_user/spark/conf:/home/glue_user/spark/conf: org.apache.livy.server.LivyServer, logging to /home/glue_user/livy/logs/livy-glue_user-server.out
SSL Disabled
[I 2022-09-20 14:06:26.303 ServerApp] jupyterlab | extension was successfully linked.
[I 2022-09-20 14:06:26.314 ServerApp] nbclassic | extension was successfully linked.
[I 2022-09-20 14:06:26.315 ServerApp] Writing Jupyter server cookie secret to /home/glue_user/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2022-09-20 14:06:27.498 ServerApp] sparkmagic | extension was found and enabled by notebook_shim. Consider moving the extension to Jupyter Server's extension paths.
[I 2022-09-20 14:06:27.498 ServerApp] sparkmagic | extension was successfully linked.
[I 2022-09-20 14:06:27.498 ServerApp] notebook_shim | extension was successfully linked.
[W 2022-09-20 14:06:27.523 ServerApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
[I 2022-09-20 14:06:27.525 ServerApp] notebook_shim | extension was successfully loaded.
[I 2022-09-20 14:06:27.526 LabApp] JupyterLab extension loaded from /home/glue_user/.local/lib/python3.7/site-packages/jupyterlab
[I 2022-09-20 14:06:27.526 LabApp] JupyterLab application directory is /home/glue_user/.local/share/jupyter/lab
[I 2022-09-20 14:06:27.530 ServerApp] jupyterlab | extension was successfully loaded.
[I 2022-09-20 14:06:27.536 ServerApp] nbclassic | extension was successfully loaded.
[I 2022-09-20 14:06:27.536 ServerApp] sparkmagic extension enabled!
[I 2022-09-20 14:06:27.536 ServerApp] sparkmagic | extension was successfully loaded.
[I 2022-09-20 14:06:27.537 ServerApp] Serving notebooks from local directory: /home/glue_user/workspace/jupyter_workspace
[I 2022-09-20 14:06:27.537 ServerApp] Jupyter Server 1.18.1 is running at:
[I 2022-09-20 14:06:27.537 ServerApp] http://xxxxxxxxxxxxx:8888/lab
[I 2022-09-20 14:06:27.537 ServerApp]  or http://127.0.0.1:8888/lab
[I 2022-09-20 14:06:27.537 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

JupyterLab works fine. I can run this command and get result.

def retrieve_tables(database_name):
    session = boto3.session.Session()
    glue_client = session.client("glue")
    response_get_tables = glue_client.get_tables(DatabaseName=database_name)
    return response_get_tables
[table_dict["Name"] for table_dict in retrieve_tables("Name")["TableList"]]

Unfortunately, when I run this command, I'm getting an error.

my_df = glueContext.create_dynamic_frame.from_catalog(
    database='database_name',
    table_name='table_name')

the error I am getting is

An error was encountered:
An error occurred while calling o70.getCatalogSource. Trace:
py4j.Py4JException: Method getCatalogSource([class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.String, class com.amazonaws.services.glue.util.JsonOptions, null]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

Traceback (most recent call last):
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 625, in from_catalog
    return self._glue_context.create_dynamic_frame_from_catalog(db, table_name, redshift_tmp_dir, transformation_ctx, push_down_predicate, additional_options, catalog_id, **kwargs)
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/context.py", line 177, in create_dynamic_frame_from_catalog
    makeOptions(self._sc, additional_options), catalog_id),
  File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value
    format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling o70.getCatalogSource. Trace:
py4j.Py4JException: Method getCatalogSource([class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.String, class com.amazonaws.services.glue.util.JsonOptions, null]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

One more thing, I run this command spark.sql("show databases"). The error I am getting is

An error was encountered:
An error occurred while calling o77.toString. Trace:
java.lang.IllegalArgumentException: object is not an instance of declaring class
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

Traceback (most recent call last):
  File "/home/glue_user/spark/python/pyspark/sql/session.py", line 723, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 113, in deco
    converted = convert_exception(e.java_exception)
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 86, in convert_exception
    return AnalysisException(s.split(': ', 1)[1], stacktrace, c)
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 27, in __init__
    self.cause = convert_exception(cause) if cause is not None else None
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 105, in convert_exception
    return UnknownException(s, stacktrace, c)
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 27, in __init__
    self.cause = convert_exception(cause) if cause is not None else None
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 98, in convert_exception
    c.toString().startswith('org.apache.spark.api.python.PythonException: ')
  File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value
    format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling o77.toString. Trace:
java.lang.IllegalArgumentException: object is not an instance of declaring class
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

I have few questions:

DenisMurakhovskiy commented 1 year ago

I found a solution.