awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
633 stars 298 forks source link

Glue libs Docker image: Create dynamic frame from catalog does not work properly with Lake Formation #142

Closed simonvdk closed 1 year ago

simonvdk commented 2 years ago

I have an issue when using the GlueContext.create_dynamic_frame_from_catalog method in order to load a Glue Data Catalog table when Lakeformation is activated on that Glue Data Catalog.

Steps to reproduce: (reproduced with both images amazon/aws-glue-libs:glue_libs_2.0.0_image_01 and amazon/aws-glue-libs:glue_libs_3.0.0_image_01)

from awsglue.context import GlueContext
glue_context = GlueContext(sc)
df = glue_context.create_dynamic_frame_from_catalog(database="my_glue_db", table_name="my_glue_table")

Result:

The create_dynamic_frame_from_catalog call returns an 403 S3 access denied (see below for more details on traceback)

Conclusions:

Note: From within the container, I was able to successfully query the same Glue table using Athena. This indicates that the error above is not due to how I launched my container or other configurations, as the Lake Formation credentials vending worked well with the Athena query. The issue hence comes from how the glue libs handles Lakeformation credentials vending when retrieving the table

Full traceback:

22/07/06 17:22:26 WARN DataCatalogWrapper: Cell Filtering is not supported in local development.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/context.py", line 155, in create_dynamic_frame_from_catalog
  File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame
  File "/home/glue_user/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/glue_user/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o96.getDynamicFrame.
: java.nio.file.AccessDeniedException: s3://bucket/key: getFileStatus on s3://bucket/key: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: XXXXXXXXX; S3 Extended Request ID: XXXXXXXXX; Proxy: null), S3 Extended Request ID: XXXXXXXXX
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:158)
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:101)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1568)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:117)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:2040)
    at com.amazonaws.services.glue.util.FileSystemFolder.listFiles(FileLister.scala:227)
    at com.amazonaws.services.glue.hadoop.DefaultPartitionFilesLister$$anonfun$_partitions$1.apply(FileSystemBookmark.scala:83)
    at com.amazonaws.services.glue.hadoop.DefaultPartitionFilesLister$$anonfun$_partitions$1.apply(FileSystemBookmark.scala:81)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:355)
    at com.amazonaws.services.glue.hadoop.DefaultPartitionFilesLister._partitions(FileSystemBookmark.scala:81)
    at com.amazonaws.services.glue.hadoop.DefaultPartitionFilesLister.partitions(FileSystemBookmark.scala:77)
    at com.amazonaws.services.glue.SparkSQLDataSource$$anonfun$getDynamicFrame$9.apply(DataSource.scala:724)
    at com.amazonaws.services.glue.SparkSQLDataSource$$anonfun$getDynamicFrame$9.apply(DataSource.scala:702)
    at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:89)
    at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:89)
    at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWith(FileSchemeWrapper.scala:82)
    at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWithQualifiedScheme(FileSchemeWrapper.scala:89)
    at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:701)
    at com.amazonaws.services.glue.DataSource$class.getDynamicFrame(DataSource.scala:97)
    at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:683)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: XXXXXXXXX; S3 Extended Request ID: XXXXXXXXX; Proxy: null), S3 Extended Request ID: XXXXXXXXX
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5453)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5400)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1372)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1346)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:904)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1553)
    ... 33 more
moomindani commented 1 year ago

Lake Formation permissions are not supported in aws-glue-lib's Docker container. To use Lake Formation permission integrations, we recommend you to run the job on Glue job system (Glue jobs, or Glue Interactive Sessions).