Unable to load an iceberg table from aws glue catalog

arookieds commented 3 months ago

Question

PyIceberg version: 0.6.0 Python version: 3.11.1

Comments:

Iceberg tables are saved in a AWS Glue catalog
catalog, list of namespaces and list of tables are retrievable through the catalog api

Hi,

I am facing issues loading iceberg tables from AWS Glue. The code I am using is as follow:

from opensea.resources.resources import *
import pyiceberg.catalog

profile_name = "saml2aws_profile_name"
catalog_name = "catalog name"
table_name = "table name"
aws_region = "aws region"

catalog = pyiceberg.catalog.load_catalog(
    catalog_name, **{"type": "glue", "profile_name": profile_name}
)

print(catalog.list_namespaces())

table = catalog.load_table((catalog_name, table_name))

The code allow me to:

list namespaces
list tables

But load_table throw the following error:

Traceback (most recent call last):
  File "/path/to/the/project/testing.py", line 15, in <module>
    table = catalog.load_table((catalog_name, table_name))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 473, in load_table
    return self._convert_glue_to_iceberg(self._get_glue_table(database_name=database_name, table_name=table_name))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 296, in _convert_glue_to_iceberg
    metadata = FromInputFile.table_metadata(file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/serializers.py", line 112, in table_metadata
    with input_file.open() as input_stream:
         ^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 263, in open
    input_file = self._filesystem.open_input_file(self._path)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 780, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

I have checked I have the proper accesses, but it wasn't the issue. I have tried a few other things but they were all unsuccessful.

using _loadglue, instead of _loadcatalog
providing access_key and secret_key directly in the load_catalog call

The table definition is as follow and was created via Trino:

create table catalog_name.table_name (
          "timestamp" timestamp,
          "type" varchar(20),
          distribution int,
          service int,
          code varchar(20),
          base_id bigint,
          counter_id bigint,
          "category" varchar(50),
          volume double)
        with (
          format = 'PARQUET',
          partitioning = ARRAY['day(timestamp)'],
          location = 's3://s3_bucket/path/to/table/folder/'
        )

kevinjqliu commented 3 months ago

OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

This seems to be an issue with reading the metadata file. Specifically, this line https://github.com/apache/iceberg-python/blob/781096eb0c71fa540357e0e6e3b51104ad6469ee/pyiceberg/catalog/glue.py#L320

What is the metadata_location of the table in the Glue catalog?

arookieds commented 3 months ago

Glue point to that same file:

I have tried reading this table using PySpark, and it worked. Nevertheless, PySpark isn't the ideal solution for my case.

kevinjqliu commented 3 months ago

If it works in PySpark, it's probably not the Glue configuration but in pyiceberg.

Can you double-check the AWS settings? Your AWS profile looks like it can access the Glue catalog and read its content. Does it have permission to read the underlying s3 file?

Secondly,

OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

That S3 path looks fishy to me. Esp the prefix path/to/s3/table/location/metadata/ and no s3://. We can also check if PyArrow FS is parsing the metadata_location correctly

arookieds commented 3 months ago

Can you double-check the AWS settings? Your AWS profile looks like it can access the Glue catalog and read its content. Does it have permission to read the underlying s3 file?

Yes, the profile I am using can access the underlying files in S3.

That S3 path looks fishy to me. Esp the prefix path/to/s3/table/location/metadata/ and no s3://. We can also check if PyArrow FS is parsing the metadata_location correctly

The path I am using starts, indeed, with s3://.

kevinjqliu commented 3 months ago

The load_table operation is doing a couple of different things. Let's verify each step.

Getting the "glue table" object, using the _get_glue_table function

catalog = pyiceberg.catalog.load_catalog(
    catalog_name, **{"type": "glue", "profile_name": profile_name}
)

identifier_tuple = catalog.identifier_to_tuple_without_catalog(identifier)
database_name, table_name = catalog.identifier_to_database_and_table(identifier_tuple, NoSuchTableError)
glue_table = catalog._get_glue_table(database_name=database_name, table_name=table_name)
print(glue_table)

Look at glue table metadata location

properties = glue_table["Parameters"]
METADATA_LOCATION = "metadata_location"
metadata_location = properties[METADATA_LOCATION]
print(metadata_location)

Load the metadata file, check the io implementation

io = load_file_io(properties=catalog.properties, location=metadata_location)
print(io)
file = io.new_input(metadata_location)
print(file)
metadata = FromInputFile.table_metadata(file)
print(metadata)

kevinjqliu commented 3 months ago

https://github.com/apache/iceberg/issues/6820

similar sounding issue

geruh commented 3 months ago

Your glue calls look, fine but your S3 calls are the problem. I was able to reproduce the issue by having the incorrect region for my AWS profile at ./aws/config and passing in the region config upon initializing the catalog.

aws_config

[test]
region = us-east-1

catalog init

catalog = pyiceberg.catalog.load_catalog(
    catalog_name, **{"type": "glue", "profile_name": "test", "region_name": "us-west-2"}
)

Which leads to this exception

File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'test/metadata/00000-c0fc4e45-d79d-41a1-ba92-a4122c09171c.metadata.json' in bucket 'test_bucket': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body.

It looks like when we infer the correct FileIO the PyarrowFs doesn't utilize the aws profile config. Which might be delegating the calls to the default profile instead.

https://github.com/apache/iceberg-python/blob/7fcdb8d25dfa2498ba98a2b8e8d2b327d85fa7c9/pyiceberg/io/pyarrow.py#L339-L357

We might need to feed the credentials into the session properties before inferring the FileIO in the GlueCatalog, so that we actually use the correct profile when reading from S3. For now you should be able to work around this by ensuring the profiles region is in sync with the config passed into the catalog. Or pass in the s3.region property into the catalog

edit: just saw the message above the fix is also there

kevinjqliu commented 3 months ago

@geruh thanks for the explanation! Would you say this is a bug in how pyiceberg configures S3? I'm not familiar with the AWS profile config. It seems like if a profile config is passed in, we don't want to override other S3 options, such as region in this case.

geruh commented 3 months ago

No Problem!!

This could potentially be a bug if we assume that the catalog and FileIO (S3) share the same aws profile configs. On one side, having a single profile configuration is convenient for the user's boto client, as it allows initializing all AWS clients with the correct credentials. However, on the other hand, we could argue that this configuration should only work at the catalog level, and for filesystems, separate configurations might be required. I'm inclined towards the first option. However, we are using pyarrow's S3FileSystem implementation, which has no concept of a aws profile. Therefore, we will need to initialize these values through boto's session.get_credentials() and pass them to the filesystem.

I'll raise an issue for this

kevinjqliu commented 3 months ago

thank you! should we close this in favor of #570?

arookieds commented 2 months ago

I have tried both solution, ie:

setting the env variable to the proper AWS region
providing it within the function call But I am always getting the same error:

Traceback (most recent call last):
  File "/path/to/the/project/testing.py", line 15, in <module>
    table = catalog.load_table((catalog_name, table_name))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 473, in load_table
    return self._convert_glue_to_iceberg(self._get_glue_table(database_name=database_name, table_name=table_name))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 296, in _convert_glue_to_iceberg
    metadata = FromInputFile.table_metadata(file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/serializers.py", line 112, in table_metadata
    with input_file.open() as input_stream:
         ^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 263, in open
    input_file = self._filesystem.open_input_file(self._path)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 780, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

geruh commented 2 months ago

Interesting can you run aws sts get-caller-identity in the terminal to ensure the right identity is being used?

you can also, explicitly set the S3FileIO by passing in the s3 configuration properties into the catalog.

    catalog = pyiceberg.catalog.load_catalog(catalog_name,
                                             **{
                                                 "type": "glue",
                                                 "profile": profile_name,
                                                 "s3.access-key-id": "access-key",
                                                 "s3.secret-access-key": "secret-access-key",
                                                 "s3.region": "us-east-1"
                                             })

hamzaezzi commented 2 months ago

Interesting can you run aws sts get-caller-identity in the terminal to ensure the right identity is being used?

you can also, explicitly set the S3FileIO by passing in the s3 configuration properties into the catalog.

    catalog = pyiceberg.catalog.load_catalog(catalog_name,
                                             **{
                                                 "type": "glue",
                                                 "profile": profile_name,
                                                 "s3.access-key-id": "access-key",
                                                 "s3.secret-access-key": "secret-access-key",
                                                 "s3.region": "us-east-1"
                                             })

this worked for me when i also added the token information for the s3

catalog = load_catalog( "default", **{"type": "glue", "aws_access_key_id": "ASAXXXXXXXXXX", "aws_secret_access_key": "0VLxnXXXXXXXXXXX", "aws_session_token": "IQJb3JpXXXXXXXXXXXXXXXXXXXXXXXX", "s3.access-key-id": "ASAXXXXXXXXXX", "s3.secret-access-key": "0VLxnXXXXXXXXXXX", "s3.session-token": "IQJb3JpXXXXXXXXXXXXXXXXXXXXXXXX", "s3.region": "eu-west-1", "region_name": "eu-west-1" }, )

anatol-ju commented 2 months ago

We have the same problem here. My manager and me tried to get it to work in parallel and both ran into the same error. We assumed it is a permission issue, but even with admin credentials it didn't work. We used access token, tried to set region manually, provided AWS profile name and alternatively the access keys. No success.

My guess is that it has something to do with the s3fs package used to read the metadata file.

impproductions commented 5 days ago

We had the same problem within our Airflow deployment. The easy fix for us would have been to set the default aws credentials through environment variables:

AWS_ACCESS_KEY_ID=<aws region>
AWS_DEFAULT_REGION=<aws access key>
AWS_SECRET_ACCESS_KEY=<aws secret key>

This, however, wasn't feasible because of deployment issues. Long story short, we ended up with this solution:

glue_catalog_conf = {
    "s3.region": <aws region>,
    "s3.access-key-id": <aws access key>,
    "s3.secret-access-key": <aws secret key>,
    "region_name": <aws region>,
    "aws_access_key_id": <aws access_key>,
    "aws_secret_access_key": <aws secret key>,
}

catalog: GlueCatalog = load_catalog(
    "some_name",
    type="glue",
    **glue_catalog_conf
)

If you come from a google search, please take everything that follows with a grain of salt, because we have no previous experience with either pyiceberg or airflow. Anyway.

We came to this conclusion (that we needed to pass both formats) because it seems that the the boto client initialization expects one format (the second set in the above snippet):

class GlueCatalog(Catalog):
    def __init__(self, name: str, **properties: Any):
        super().__init__(name, **properties)

        session = boto3.Session(
            profile_name=properties.get("profile_name"),
            region_name=properties.get("region_name"),
            botocore_session=properties.get("botocore_session"),
            aws_access_key_id=properties.get("aws_access_key_id"),
            aws_secret_access_key=properties.get("aws_secret_access_key"),
            aws_session_token=properties.get("aws_session_token"),
        )
        self.glue: GlueClient = session.client("glue")

And the same set of properties is passed to the load_file_io pyiceberg function, which, to the extent of our very limited understanding, expects the other format (s3.stuff):

io = load_file_io(properties=self.properties, location=metadata_location)
file = io.new_input(metadata_location)
metadata = FromInputFile.table_metadata(file)
return Table(
    identifier=(self.name, database_name, table_name),
    metadata=metadata,
    metadata_location=metadata_location,
    io=self._load_file_io(metadata.properties, metadata_location),
    catalog=self,
)

We might be completely off base here, of course, and what ultimately convinced us to adopt the above solution is just that it works, while passing either set of credentials without the other wouldn't work for us.

We're using:

aiobotocore==2.13.1
boto3==1.34.51
botocore==1.34.131
[...]
pyiceberg==0.6.1

We're still unclear on whether it's indeed a bug or we're just using the APIs improperly, any help would be appreciated.

Have a nice day!

kevinjqliu commented 4 days ago

@impproductions Thanks for the detailed explanation. Great catch!

Looking through the code, there's indeed an expectation for both AWS credential formats. s3.access-key-id vs aws_access_key_id s3.secret-access-key vs aws_secret_access_key

This issue exists for both glue and dynamodb catalogs https://github.com/search?q=repo%3Aapache%2Ficeberg-python+aws_secret_access_key+path%3A.py+-path%3Atests&type=code

kevinjqliu commented 4 days ago

Opened #892 to track the issue with AWS credential formats

apache / iceberg-python

Unable to load an iceberg table from aws glue catalog #515

Question