apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
309 stars 114 forks source link

(AWS Lake Formation shared resources) Iceberg tables in AWS Glue catalog has a different root namespace than the original #845

Open doctormohamed opened 1 week ago

doctormohamed commented 1 week ago

Apache Iceberg version

0.6.0 (latest release)

Please describe the bug 🐞

Hi,

I just identified the bug when I was trying to do some tests with our AWS Glue catalog that uses Lake formation. The catalog is governed by Lakeformation over multiple AWS accounts.

  1. When the table is created, say its name is "iceberg_table" in a namespace called "iceberg_ns"
  2. The table root namespace is "gov-iceberg_ns" (created automatically by the shared resource)
  3. When I try to write to the table, the do_commit will look for the "gov-iceberg_ns" and btw throws the following exception in pyiceberg/catalog/glue.py:332 in GlueCatalog._get_glue_table function:

NoSuchTableError: Table does not exist: gov-iceberg_ns.iceberg_table

As a workaround, before each commit action I do:

table = catalog.load_table(f'{namespace}.{table_name}') table.identifier = (catalog_name, namespace, table_name) # <--- add this line table.overwrite(df) # <--- commit

Hope this helps to fix the issue.

kevinjqliu commented 1 week ago

I'm not too familiar with Glue catalog + LakeFormation, but I want to summary the issue above.

  1. An iceberg table was created iceberg_ns.iceberg_table where iceberg_ns is the namespace and iceberg_table is the table
  2. The table root namespace is "gov-iceberg_ns" is this a LakeFormation concept?
  3. The table can be loaded as iceberg_ns.iceberg_table.
    table = catalog.load_table("iceberg_ns.iceberg_table")

    but errors when writing, table.overwrite(df)

Do you mind providing an example that I can use to reproduce the issue?

doctormohamed commented 1 week ago

Here is an example:

import pyiceberg.catalog
import os
import pyarrow as pa

catalog_name = "AwsGlueCatalog"

catalog = pyiceberg.catalog.load_catalog(
    catalog_name, **{"type": "glue"}
)

df = pa.Table.from_pylist(
    [
        {"col_1": 1, "col_2": "d"}
    ]
)
database = 'db'
table_name = 'table'
table = catalog.load_table(f'{database}.{table_name}')
# table.identifier = (catalog_name, database, table_name)
table.overwrite(df)
doctormohamed commented 1 week ago

@kevinjqliu any news on this? please let me know if you need more clarification :)

kevinjqliu commented 1 week ago

To clarify, the example above does not work unless the table.identifier override is applied, right?

Non-working example:

database = 'db'
table_name = 'table'
table = catalog.load_table(f'{database}.{table_name}')
table.overwrite(df)

Working example:

database = 'db'
table_name = 'table'
table = catalog.load_table(f'{database}.{table_name}')
table.identifier = (catalog_name, database, table_name)
table.overwrite(df)

For Glue catalog, the catalog name is used as part of the table identifier https://github.com/apache/iceberg-python/blob/a6cd0cf325b87b360077bad1d79262611ea64424/pyiceberg/catalog/glue.py#L326

Maybe there's a mismatch between the glue name and the catalog name

kevinjqliu commented 1 week ago

Is there a way to reproduce this in a test case, test_glue.py?

doctormohamed commented 1 week ago

@kevinjqliu

To clarify, the example above does not work unless the table.identifier override is applied, right? --> Correct

What Im overriding here is the database name.

See this:

import pandas as pd
import pyarrow as pa

database = 'demo_fs'
table_name = 'demo_table_20240620_pyiceberg1'

df = pd.DataFrame({'col_1': [1, 2], 'col_2': ['a', 'b']})
df = pa.Table.from_pandas(df)

table = catalog.create_table(
    identifier=f'{database}.{table_name}',
    schema=df.schema
)
print(table.identifier)

Result: ('AwsGlueCatalog', 'gov-demo_fs', 'demo_table_20240620_pyiceberg1')

As you notice, the new database name has changed, due to AWS Lakeformation resource link activated, adding a prefix "gov-" to the database in Glue (as it is the 'real' Database I think...)

Here is a snapshot to the error: image

doctormohamed commented 1 week ago

Is there a way to reproduce this in a test case, test_glue.py?

Im not sure how to deal with Lakeformation in a test case πŸ”’

kevinjqliu commented 1 week ago

Oh! So the returned table name differs from the one specified in create_table. And this assertion will fail.

database = 'demo_fs'
table_name = 'demo_table_20240620_pyiceberg1'

df = pd.DataFrame({'col_1': [1, 2], 'col_2': ['a', 'b']})
df = pa.Table.from_pandas(df)

original_table_name = f'{database}.{table_name}'

table = catalog.create_table(
    identifier=original_table_name,
    schema=df.schema
)

assert table.identifier == original_table_name

original table name is demo_fs.demo_table_20240620_pyiceberg1 but table.identifier is ('AwsGlueCatalog', 'gov-demo_fs', 'demo_table_20240620_pyiceberg1')

Do you know if there's docs on Glue/LakeFormation behavior that would suggest the addition of the gov- prefix?

doctormohamed commented 1 week ago

When I asked ChatGPT about it, it says the following:

AWS Lake Formation may add a prefix to your database names in AWS Glue to help distinguish between databases created natively in AWS Glue and those managed by Lake Formation. This can occur due to certain configurations or default behaviors in Lake Formation.

doctormohamed commented 1 week ago

Actually the DB was created manually by our Architech from AWS lakeformation console, creating first the gov database, then the non-prefix db and then resource link between them both. So its a lakeformation managed database, not native Glue database

doctormohamed commented 1 week ago

To clarify, the "gov-demo_fs" database is Lakeformation managed, the other database is "demo_fs" is a Glue native database.

image

kevinjqliu commented 1 week ago

Very odd to see this behavior...

It seems like the issue is around create_table / load_table

https://github.com/apache/iceberg-python/blob/a6cd0cf325b87b360077bad1d79262611ea64424/pyiceberg/catalog/glue.py#L391-L405

https://github.com/apache/iceberg-python/blob/a6cd0cf325b87b360077bad1d79262611ea64424/pyiceberg/catalog/glue.py#L524

Can you try stepping through the calls and see where the gov- prefix is returned?

It's odd to me that we called glue with one namespace/database and another one is returned.

doctormohamed commented 6 days ago

Ok I found the issue, it is in :

    def _get_glue_table(self, database_name: str, table_name: str) -> TableTypeDef:
        try:
            load_table_response = self.glue.get_table(DatabaseName=database_name, Name=table_name)
            return load_table_response["Table"]
        except self.glue.exceptions.EntityNotFoundException as e:
            raise NoSuchTableError(f"Table does not exist: {database_name}.{table_name}") from e

It returns the following:

{'Name': 'demo_table_20240620_pyiceberg',
 'DatabaseName': 'gov-dev_demo_fs',
 'CreateTime': datetime.datetime(2024, 6, 21, 13, 6, 10, tzinfo=tzlocal()),
 'UpdateTime': datetime.datetime(2024, 6, 25, 14, 21, 36, tzinfo=tzlocal()),
 'Retention': 0,
 'StorageDescriptor': {'Columns': [{'Name': 'col_1',
    'Type': 'bigint',
    'Parameters': {'iceberg.field.current': 'true',
     'iceberg.field.id': '1',
     'iceberg.field.optional': 'true'}},
   {'Name': 'col_2',
    'Type': 'string',
    'Parameters': {'iceberg.field.current': 'true',
     'iceberg.field.id': '2',
     'iceberg.field.optional': 'true'}}],
  'Location': ',
  'Compressed': False,
  'NumberOfBuckets': 0,
  'SortColumns': [],
  'StoredAsSubDirectories': False},
 'TableType': 'EXTERNAL_TABLE',
 'Parameters': {'metadata_location': '',
  'previous_metadata_location': '',
  'table_type': 'ICEBERG'},
 'CreatedBy': '',
 'IsRegisteredWithLakeFormation': False,
 'CatalogId': '',
 'VersionId': '3'}
kevinjqliu commented 6 days ago

What did you call the _get_glue_table function with?

self._get_glue_table(database_name=database_name, table_name=table_name)

Is the issue here that we're calling _get_glue_table with

database = 'demo_fs'
table_name = 'demo_table_20240620_pyiceberg1'

but glue returns back

'Name': 'demo_table_20240620_pyiceberg',
 'DatabaseName': 'gov-dev_demo_fs',
doctormohamed commented 6 days ago

What did you call the _get_glue_table function with?

self._get_glue_table(database_name=database_name, table_name=table_name)

Is the issue here that we're calling _get_glue_table with

database = 'demo_fs'
table_name = 'demo_table_20240620_pyiceberg1'

but glue returns back

'Name': 'demo_table_20240620_pyiceberg',
 'DatabaseName': 'gov-dev_demo_fs',

Yes, here is my code:

import pyiceberg.catalog

glue_catalog = 'AwsGlueCatalog'
catalog = pyiceberg.catalog.load_catalog(
    glue_catalog, **{"type": "glue"}
)
database = 'dev_demo_fs'
table_name = 'demo_table_20240620_pyiceberg'

table = catalog.load_table(f'{database}.{table_name}')
doctormohamed commented 6 days ago

And I see that _get_glue_table is used in two functions:

glue.py: _commit_table() glue.py: load_table()

kevinjqliu commented 6 days ago
self.glue.get_table(DatabaseName=database_name, Name=table_name)

get_table is a glue function from botocore. I think this is an issue with Glue. Is there a place to raise Glue related issues?

geruh commented 5 days ago

This could be a LakeFormation permission issue. The caller should technically have full table access if they made the create table request. But that requires them to be a data lake admin. Can you verify if the AWS user/role making this request has the right permissions in Lake Formation?

Furthermore, you should be able to grant the caller select/insert access on this table after its creation to mitigate the issue.