Open doctormohamed opened 1 week ago
I'm not too familiar with Glue catalog + LakeFormation, but I want to summary the issue above.
iceberg_ns.iceberg_table
where iceberg_ns
is the namespace and iceberg_table
is the tableThe table root namespace is "gov-iceberg_ns"
is this a LakeFormation concept? iceberg_ns.iceberg_table
.
table = catalog.load_table("iceberg_ns.iceberg_table")
but errors when writing, table.overwrite(df)
Do you mind providing an example that I can use to reproduce the issue?
Here is an example:
import pyiceberg.catalog
import os
import pyarrow as pa
catalog_name = "AwsGlueCatalog"
catalog = pyiceberg.catalog.load_catalog(
catalog_name, **{"type": "glue"}
)
df = pa.Table.from_pylist(
[
{"col_1": 1, "col_2": "d"}
]
)
database = 'db'
table_name = 'table'
table = catalog.load_table(f'{database}.{table_name}')
# table.identifier = (catalog_name, database, table_name)
table.overwrite(df)
@kevinjqliu any news on this? please let me know if you need more clarification :)
To clarify, the example above does not work unless the table.identifier
override is applied, right?
Non-working example:
database = 'db'
table_name = 'table'
table = catalog.load_table(f'{database}.{table_name}')
table.overwrite(df)
Working example:
database = 'db'
table_name = 'table'
table = catalog.load_table(f'{database}.{table_name}')
table.identifier = (catalog_name, database, table_name)
table.overwrite(df)
For Glue catalog, the catalog name is used as part of the table identifier https://github.com/apache/iceberg-python/blob/a6cd0cf325b87b360077bad1d79262611ea64424/pyiceberg/catalog/glue.py#L326
Maybe there's a mismatch between the glue name and the catalog name
Is there a way to reproduce this in a test case, test_glue.py
?
@kevinjqliu
To clarify, the example above does not work unless the table.identifier override is applied, right? --> Correct
What Im overriding here is the database name.
See this:
import pandas as pd
import pyarrow as pa
database = 'demo_fs'
table_name = 'demo_table_20240620_pyiceberg1'
df = pd.DataFrame({'col_1': [1, 2], 'col_2': ['a', 'b']})
df = pa.Table.from_pandas(df)
table = catalog.create_table(
identifier=f'{database}.{table_name}',
schema=df.schema
)
print(table.identifier)
Result:
('AwsGlueCatalog', 'gov-demo_fs', 'demo_table_20240620_pyiceberg1')
As you notice, the new database name has changed, due to AWS Lakeformation resource link activated, adding a prefix "gov-" to the database in Glue (as it is the 'real' Database I think...)
Here is a snapshot to the error:
Is there a way to reproduce this in a test case,
test_glue.py
?
Im not sure how to deal with Lakeformation in a test case π’
Oh! So the returned table name differs from the one specified in create_table
.
And this assertion will fail.
database = 'demo_fs'
table_name = 'demo_table_20240620_pyiceberg1'
df = pd.DataFrame({'col_1': [1, 2], 'col_2': ['a', 'b']})
df = pa.Table.from_pandas(df)
original_table_name = f'{database}.{table_name}'
table = catalog.create_table(
identifier=original_table_name,
schema=df.schema
)
assert table.identifier == original_table_name
original table name is demo_fs.demo_table_20240620_pyiceberg1
but table.identifier is ('AwsGlueCatalog', 'gov-demo_fs', 'demo_table_20240620_pyiceberg1')
Do you know if there's docs on Glue/LakeFormation behavior that would suggest the addition of the gov-
prefix?
When I asked ChatGPT about it, it says the following:
AWS Lake Formation may add a prefix to your database names in AWS Glue to help distinguish between databases created natively in AWS Glue and those managed by Lake Formation. This can occur due to certain configurations or default behaviors in Lake Formation.
Actually the DB was created manually by our Architech from AWS lakeformation console, creating first the gov database, then the non-prefix db and then resource link between them both. So its a lakeformation managed database, not native Glue database
To clarify, the "gov-demo_fs" database is Lakeformation managed, the other database is "demo_fs" is a Glue native database.
Very odd to see this behavior...
It seems like the issue is around create_table
/ load_table
Can you try stepping through the calls and see where the gov-
prefix is returned?
It's odd to me that we called glue with one namespace/database and another one is returned.
Ok I found the issue, it is in :
def _get_glue_table(self, database_name: str, table_name: str) -> TableTypeDef:
try:
load_table_response = self.glue.get_table(DatabaseName=database_name, Name=table_name)
return load_table_response["Table"]
except self.glue.exceptions.EntityNotFoundException as e:
raise NoSuchTableError(f"Table does not exist: {database_name}.{table_name}") from e
It returns the following:
{'Name': 'demo_table_20240620_pyiceberg',
'DatabaseName': 'gov-dev_demo_fs',
'CreateTime': datetime.datetime(2024, 6, 21, 13, 6, 10, tzinfo=tzlocal()),
'UpdateTime': datetime.datetime(2024, 6, 25, 14, 21, 36, tzinfo=tzlocal()),
'Retention': 0,
'StorageDescriptor': {'Columns': [{'Name': 'col_1',
'Type': 'bigint',
'Parameters': {'iceberg.field.current': 'true',
'iceberg.field.id': '1',
'iceberg.field.optional': 'true'}},
{'Name': 'col_2',
'Type': 'string',
'Parameters': {'iceberg.field.current': 'true',
'iceberg.field.id': '2',
'iceberg.field.optional': 'true'}}],
'Location': ',
'Compressed': False,
'NumberOfBuckets': 0,
'SortColumns': [],
'StoredAsSubDirectories': False},
'TableType': 'EXTERNAL_TABLE',
'Parameters': {'metadata_location': '',
'previous_metadata_location': '',
'table_type': 'ICEBERG'},
'CreatedBy': '',
'IsRegisteredWithLakeFormation': False,
'CatalogId': '',
'VersionId': '3'}
What did you call the _get_glue_table
function with?
self._get_glue_table(database_name=database_name, table_name=table_name)
Is the issue here that we're calling _get_glue_table
with
database = 'demo_fs'
table_name = 'demo_table_20240620_pyiceberg1'
but glue returns back
'Name': 'demo_table_20240620_pyiceberg',
'DatabaseName': 'gov-dev_demo_fs',
What did you call the
_get_glue_table
function with?self._get_glue_table(database_name=database_name, table_name=table_name)
Is the issue here that we're calling
_get_glue_table
withdatabase = 'demo_fs' table_name = 'demo_table_20240620_pyiceberg1'
but glue returns back
'Name': 'demo_table_20240620_pyiceberg', 'DatabaseName': 'gov-dev_demo_fs',
Yes, here is my code:
import pyiceberg.catalog
glue_catalog = 'AwsGlueCatalog'
catalog = pyiceberg.catalog.load_catalog(
glue_catalog, **{"type": "glue"}
)
database = 'dev_demo_fs'
table_name = 'demo_table_20240620_pyiceberg'
table = catalog.load_table(f'{database}.{table_name}')
And I see that _get_glue_table is used in two functions:
glue.py: _commit_table() glue.py: load_table()
self.glue.get_table(DatabaseName=database_name, Name=table_name)
get_table
is a glue function from botocore. I think this is an issue with Glue.
Is there a place to raise Glue related issues?
This could be a LakeFormation permission issue. The caller should technically have full table access if they made the create table request. But that requires them to be a data lake admin. Can you verify if the AWS user/role making this request has the right permissions in Lake Formation?
Furthermore, you should be able to grant the caller select/insert access on this table after its creation to mitigate the issue.
Apache Iceberg version
0.6.0 (latest release)
Please describe the bug π
Hi,
I just identified the bug when I was trying to do some tests with our AWS Glue catalog that uses Lake formation. The catalog is governed by Lakeformation over multiple AWS accounts.
NoSuchTableError: Table does not exist: gov-iceberg_ns.iceberg_table
As a workaround, before each commit action I do:
table = catalog.load_table(f'{namespace}.{table_name}')
table.identifier = (catalog_name, namespace, table_name) # <--- add this line
table.overwrite(df) # <--- commit
Hope this helps to fix the issue.