awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
633 stars 298 forks source link

An error occurred while calling pyWriteDynamicFrame. : java.lang.NullPointerException #64

Open orf opened 4 years ago

orf commented 4 years ago

Running the snippet from the creating new tables documentation will throw a NullPointerException if your job role does not have LakeFormation permissions over the database:

sink = glueContext.getSink(connection_type="s3", path="s3://whatever/",
                                             enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE")
sink.setFormat("json")
sink.setCatalogInfo(catalogDatabase="test-database", catalogTableName="my-catalog")
sink.writeFrame(final_frame)

The following exception is thrown:

File "script_2020-08-05-19-36-30.py", line 107, in <module>
sink.writeFrame(final_frame)
File "/mnt/yarn/usercache/root/appcache/application_1596655083053_0006/container_1596655083053_0006_01_000001/PyGlue.zip/awsglue/data_sink.py", line 31, in writeFrame
return DynamicFrame(self._jsink.pyWriteDynamicFrame(dynamic_frame._jdf, callsite(), info), dynamic_frame.glue_ctx, dynamic_frame.name + "_errors")
File "/mnt/yarn/usercache/root/appcache/application_1596655083053_0006/container_1596655083053_0006_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/mnt/yarn/usercache/root/appcache/application_1596655083053_0006/container_1596655083053_0006_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/mnt/yarn/usercache/root/appcache/application_1596655083053_0006/container_1596655083053_0006_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o114.pyWriteDynamicFrame.
: java.lang.NullPointerException
at com.amazonaws.services.glue.DataSink.getCatalogTableWithSinkElseCreateTable(DataSink.scala:364)
at com.amazonaws.services.glue.sinks.HadoopDataSink$$anonfun$writeDynamicFrame$1.apply(HadoopDataSink.scala:234)
at com.amazonaws.services.glue.sinks.HadoopDataSink$$anonfun$writeDynamicFrame$1.apply(HadoopDataSink.scala:141)
at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:63)
at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:63)
at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWith(FileSchemeWrapper.scala:57)
...

This is a bug report and something that should help anyone else facing this issue.

michelmob commented 3 years ago

I'm having the same issue. I guess is something with lake formation and the job's permission.

Thanks!

Alam-Perez commented 3 years ago

I'm having the same issue. Logs:

Caused by: java.io.IOException: Failed to connect to /172.37.210.78:38871 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)

2021-02-16 22:58:22,531 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Exception in User Class: java.lang.reflect.UndeclaredThrowableExceptionorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)org.apache.spark.executor.CoarseGrainedExecutorBackendPlugin$class.launch(CoarseGrainedExecutorBackendWrapper.scala:10)org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper$$anon$1.launch(CoarseGrainedExecutorBackendWrapper.scala:15)org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper.launch(CoarseGrainedExecutorBackendWrapper.scala:19)org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper$.main(CoarseGrainedExecutorBackendWrapper.scala:5)org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper.main(CoarseGrainedExecutorBackendWrapper.scala)sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)java.lang.reflect.Method.invoke(Method.java:498)com.amazonaws.services.glue.SparkProcessLauncherPlugin$class.invoke(ProcessLauncher.scala:38)com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:67)com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:108)com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:21)com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala) 2021-02-16 22:58:22,531 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Exception in User Class: java.lang.reflect.UndeclaredThrowableException org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862) org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281) org.apache.spark.executor.CoarseGrainedExecutorBackendPlugin$class.launch(CoarseGrainedExecutorBackendWrapper.scala:10) org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper$$anon$1.launch(CoarseGrainedExecutorBackendWrapper.scala:15) org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper.launch(CoarseGrainedExecutorBackendWrapper.scala:19) org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper$.main(CoarseGrainedExecutorBackendWrapper.scala:5) org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper.main(CoarseGrainedExecutorBackendWrapper.scala) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:498) com.amazonaws.services.glue.SparkProcessLauncherPlugin$class.invoke(ProcessLauncher.scala:38) com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:67) com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:108) com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:21) com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)

2021-02-16 22:58:22,531 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Exception in User Class: java.lang.reflect.UndeclaredThrowableExceptionorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)org.apache.spark.executor.CoarseGrainedExecutorBackendPlugin$class.launch(CoarseGrainedExecutorBackendWrapper.scala:10)org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper$$anon$1.launch(CoarseGrainedExecutorBackendWrapper.scala:15)org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper.launch(CoarseGrainedExecutorBackendWrapper.scala:19)org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper$.main(CoarseGrainedExecutorBackendWrapper.scala:5)org.apache.spark.executor.CoarseGrainedExecutorBackendWrapper.main(CoarseGrainedExecutorBackendWrapper.scala)sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)java.lang.reflect.Method.invoke(Method.java:498)com.amazonaws.services.glue.SparkProcessLauncherPlugin$class.invoke(ProcessLauncher.scala:38)com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:67)com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:108)com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:21)com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)

Thanks!

tomer-1 commented 2 years ago

@orf lake permission for a glue database? i'm not sure i understand how is that related... actually, i don't think you can give lake permission per a database, lake permission is related to data lake service.

can you elaborate how did you manage to fix this issue?

arividar commented 2 years ago

Same issue here. Is there a workaround for this?

ggjuancamilo commented 2 years ago

Hello, I faced the same issue and spent several hours looking for it. If you are sure there is not a permission issue with lake formation or IAM over your glue catalog, then you may have the same issue that I had. When running the job with glue2 for the first time it does fails with the mentioned error, but if you change to glue3 then it works just fine and if you go back to glue 2 and the table already exists it will work again. If you need to use glue2 because you need spark 2.X you can just use glueContext.write_dynamic_frame.from_options to save directly to S3 without updating, create and run a Crawler and then change the job to getSink-UPDATE_IN_DATABASE etc.. and it may work. Almost it did it for me.

FahdW commented 2 years ago

I found the same issue using glue studio, I found out it was the csv optimization option in the studio

fawad2 commented 1 year ago

I found the same issue using glue studio, I found out it was the csv optimization option in the studio

How to fix it?

Prabhu4452 commented 1 year ago

I found the same issue using glue studio, I found out it was the csv optimization option in the studio

can you please explain how to fix it?

Prabhu4452 commented 1 year ago

I found the same issue using glue studio, I found out it was the csv optimization option in the studio

How to fix it?

did u fix it?

MilesMartinez commented 1 year ago

I resolved this issue by realizing my glue job and database were not on the same region. Make sure all your resources are in the same region!

o3bvv commented 1 year ago

It turned out that 'enableUpdateCatalog=True' is the real cause. If you do not have the table already created, the job will fail. Tested with Glue 3 and Glue 4.

Glue 4 at least explicitly printed the reason of the failure: the target table does not exist. And a ton of trash stack traces, of course.

Running the job only to produce files and running a crawler afterward can be an option. This is extremely confusing, as other jobs were running without issues. Such a wasted day. And night.