awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 299 forks source link

Support for AWS Glue v4 #158

Open aseychell opened 1 year ago

aseychell commented 1 year ago

Following the release of AWS Glue v4, when is it planned to update the aws-glue-libs to support the new version as well?

shrukul commented 1 year ago

On a similar note - when is it planned to upload 4.0.0 docker image to https://hub.docker.com/r/amazon/aws-glue-libs/tags ?

singlewind commented 1 year ago

There is no fundamental change since year. Either the project is dead or nothing updates. Assume 4.0 improving is from by upgrading pyspark. I will give a try on change python 3.10 and pyspark 3.3 to see whether it still compatible.

saviodsouza29 commented 1 year ago

Glue 4.0 libs are released here: https://github.com/awslabs/aws-glue-libs/releases/tag/v4.0

Docker image will be updated shortly.

krismanaya commented 1 year ago

@saviodsouza29 , was wondering when the Docker Image will be up for 4.0.

aseychell commented 1 year ago

@saviodsouza29

After downloading the latest spark archive, I'm getting the following error which seems to be some incorrect packaged jar file versions in the spark distribution. I'm running my job using ./bin/gluesparksubmit

TLR Tool version 4.3 used for code generation does not match the current runtime version 4.8ANTLR Runtime version 4.7.2 used for parser compilation does not match the current runtime version 4.8Traceback (most recent call last):
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/aws-glue-libs/basic_fundtransfers.py", line 79, in <module>
    FundTransfersSource = glueContext.create_dynamic_frame.from_catalog(
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/aws-glue-libs/awsglue/dynamicframe.py", line 629, in from_catalog
    return self._glue_context.create_dynamic_frame_from_catalog(db, table_name, redshift_tmp_dir, transformation_ctx, push_down_predicate, additional_options, catalog_id, **kwargs)
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/aws-glue-libs/awsglue/context.py", line 184, in create_dynamic_frame_from_catalog
    source = DataSource(self._ssql_ctx.getCatalogSource(db, table_name, redshift_tmp_dir, transformation_ctx,
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0/python/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o34.getCatalogSource.
: java.lang.NoSuchMethodError: 'void org.json4s.CustomSerializer.<init>(scala.Function1, scala.reflect.Manifest)'
    at com.amazonaws.services.glue.util.StringToBoolean$.<init>(JsonOptions.scala:77)
    at com.amazonaws.services.glue.util.StringToBoolean$.<clinit>(JsonOptions.scala)
    at com.amazonaws.services.glue.util.JsonOptions$.apply(JsonOptions.scala:71)
    at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:225)
    at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:185)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:833)
jimmymaise commented 1 year ago

@aseychell : I got the same error

https://github.com/awslabs/aws-glue-libs/issues/166

@saviodsouza29

After downloading the latest spark archive, I'm getting the following error which seems to be some incorrect packaged jar file versions in the spark distribution. I'm running my job using ./bin/gluesparksubmit

TLR Tool version 4.3 used for code generation does not match the current runtime version 4.8ANTLR Runtime version 4.7.2 used for parser compilation does not match the current runtime version 4.8Traceback (most recent call last):
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/aws-glue-libs/basic_fundtransfers.py", line 79, in <module>
    FundTransfersSource = glueContext.create_dynamic_frame.from_catalog(
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/aws-glue-libs/awsglue/dynamicframe.py", line 629, in from_catalog
    return self._glue_context.create_dynamic_frame_from_catalog(db, table_name, redshift_tmp_dir, transformation_ctx, push_down_predicate, additional_options, catalog_id, **kwargs)
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/aws-glue-libs/awsglue/context.py", line 184, in create_dynamic_frame_from_catalog
    source = DataSource(self._ssql_ctx.getCatalogSource(db, table_name, redshift_tmp_dir, transformation_ctx,
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0/python/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/Users/aldrinseychell/dev/trees/aws-glue-libs/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o34.getCatalogSource.
: java.lang.NoSuchMethodError: 'void org.json4s.CustomSerializer.<init>(scala.Function1, scala.reflect.Manifest)'
  at com.amazonaws.services.glue.util.StringToBoolean$.<init>(JsonOptions.scala:77)
  at com.amazonaws.services.glue.util.StringToBoolean$.<clinit>(JsonOptions.scala)
  at com.amazonaws.services.glue.util.JsonOptions$.apply(JsonOptions.scala:71)
  at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:225)
  at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:185)
  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.base/java.lang.reflect.Method.invoke(Method.java:568)
  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
  at py4j.Gateway.invoke(Gateway.java:282)
  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
  at py4j.commands.CallCommand.execute(CallCommand.java:79)
  at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
  at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
  at java.base/java.lang.Thread.run(Thread.java:833)
alvarosantossyngenta commented 1 year ago

Me too. With the current instructions and scripts this repository does not work.

singlewind commented 1 year ago

I don't have this issue after upgrade. Do you mind share the list of jars so we can help you with side to side compare?

jimmymaise commented 1 year ago

Here is the list list_jar.txt

singlewind commented 1 year ago

I have the same jars list. @jimmymaise I will give a try to load from catalogue later to see whether I can replicate the issue.

jordannb commented 1 year ago

I'm experiencing what looks similar when trying to create a dynamic frame from a catalog.

>>> frame = glueContext.create_dynamic_frame.from_catalog(database="some_db", table_name="some_table")

ANTLR Tool version 4.3 used for code generation does not match the current runtime version 4.8
ANTLR Runtime version 4.7.2 used for parser compilation does not match the current runtime version 4.8
ANTLR Tool version 4.3 used for code generation does not match the current runtime version 4.8
ANTLR Runtime version 4.7.2 used for parser compilation does not match the current runtime version 4.8
java.lang.NoSuchMethodError: 'void org.json4s.CustomSerializer.<init>(scala.Function1, scala.reflect.Manifest)'
    at com.amazonaws.services.glue.util.StringToBoolean$.<init>(JsonOptions.scala:124)
    at com.amazonaws.services.glue.util.StringToBoolean$.<clinit>(JsonOptions.scala)
    at com.amazonaws.services.glue.util.JsonOptions$.apply(JsonOptions.scala:108)
    at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:238)
    at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:198)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:833)
java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.services.glue.util.StringToBoolean$
    at com.amazonaws.services.glue.util.JsonOptions.liftedTree1$1(JsonOptions.scala:30)
    at com.amazonaws.services.glue.util.JsonOptions.<init>(JsonOptions.scala:29)
    at com.amazonaws.services.glue.util.JDBCConf.toJsonOptions(DataCatalogWrapper.scala:47)
    at com.amazonaws.services.glue.GlueContext.getGlueNativeJDBCSource(GlueContext.scala:514)
    at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:326)
    at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:198)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:833)

The java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.services.glue.util.StringToBoolean$ error stack trace is then repeated 5 times.

mikemlg commented 1 year ago

@singlewind : Any updates? May I ask your running information such as JVM version, OS, Python version, etc ?

alvarosantossyngenta commented 1 year ago

@singlewind : Any updates? May I ask your running information such as JVM version, OS, Python version, etc ?

Rush to present Glue 4.0 at AWS ReInvent, but then no support for the developers.

brahhmi-aws commented 1 year ago

@singlewind can you please share Java, Python versions you tried? having similar issue with Python 3.10.8 and Corretto 20 (java). Used below aws-glue-lib repo. https://github.com/awslabs/aws-glue-libs.git -b master I appreciate your swift reply.

alvarosantossyngenta commented 1 year ago

It is not affecting me anymore after you updated Docker images. Thanks

singlewind commented 1 year ago

@singlewind can you please share Java, Python versions you tried? having similar issue with Python 3.10.8 and Corretto 20 (java). Used below aws-glue-lib repo. https://github.com/awslabs/aws-glue-libs.git -b master I appreciate your swift reply.

Hope this is not too late. Here is my local change upgraded from v3

Java: amazon-corretto-8-aarch64-macos-jdk Python: 3.10.2 Spark: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-4.0/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz Glue-lib: https://github.com/awslabs/aws-glue-libs.git -b master