Open arookieds opened 3 months ago
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.
This seems to be an issue with reading the metadata file. Specifically, this line https://github.com/apache/iceberg-python/blob/781096eb0c71fa540357e0e6e3b51104ad6469ee/pyiceberg/catalog/glue.py#L320
What is the metadata_location
of the table in the Glue catalog?
Glue point to that same file:
I have tried reading this table using PySpark, and it worked. Nevertheless, PySpark isn't the ideal solution for my case.
If it works in PySpark, it's probably not the Glue configuration but in pyiceberg.
Can you double-check the AWS settings? Your AWS profile looks like it can access the Glue catalog and read its content. Does it have permission to read the underlying s3 file?
Secondly,
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.
That S3 path looks fishy to me. Esp the prefix path/to/s3/table/location/metadata/
and no s3://
. We can also check if PyArrow FS is parsing the metadata_location
correctly
Can you double-check the AWS settings? Your AWS profile looks like it can access the Glue catalog and read its content. Does it have permission to read the underlying s3 file?
Yes, the profile I am using can access the underlying files in S3.
That S3 path looks fishy to me. Esp the prefix
path/to/s3/table/location/metadata/
and nos3://
. We can also check if PyArrow FS is parsing the metadata_location correctly
The path I am using starts, indeed, with s3://
.
The load_table
operation is doing a couple of different things.
Let's verify each step.
Getting the "glue table" object, using the _get_glue_table
function
catalog = pyiceberg.catalog.load_catalog(
catalog_name, **{"type": "glue", "profile_name": profile_name}
)
identifier_tuple = catalog.identifier_to_tuple_without_catalog(identifier)
database_name, table_name = catalog.identifier_to_database_and_table(identifier_tuple, NoSuchTableError)
glue_table = catalog._get_glue_table(database_name=database_name, table_name=table_name)
print(glue_table)
Look at glue table metadata location
properties = glue_table["Parameters"]
METADATA_LOCATION = "metadata_location"
metadata_location = properties[METADATA_LOCATION]
print(metadata_location)
Load the metadata file, check the io implementation
io = load_file_io(properties=catalog.properties, location=metadata_location)
print(io)
file = io.new_input(metadata_location)
print(file)
metadata = FromInputFile.table_metadata(file)
print(metadata)
https://github.com/apache/iceberg/issues/6820
similar sounding issue
Your glue calls look, fine but your S3 calls are the problem. I was able to reproduce the issue by having the incorrect region for my AWS profile at ./aws/config
and passing in the region config upon initializing the catalog.
aws_config
[test]
region = us-east-1
catalog init
catalog = pyiceberg.catalog.load_catalog(
catalog_name, **{"type": "glue", "profile_name": "test", "region_name": "us-west-2"}
)
Which leads to this exception
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'test/metadata/00000-c0fc4e45-d79d-41a1-ba92-a4122c09171c.metadata.json' in bucket 'test_bucket': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body.
It looks like when we infer the correct FileIO the PyarrowFs doesn't utilize the aws profile config. Which might be delegating the calls to the default profile instead.
We might need to feed the credentials into the session properties before inferring the FileIO in the GlueCatalog, so that we actually use the correct profile when reading from S3. For now you should be able to work around this by ensuring the profiles region is in sync with the config passed into the catalog. Or pass in the s3.region
property into the catalog
edit: just saw the message above the fix is also there
@geruh thanks for the explanation! Would you say this is a bug in how pyiceberg configures S3? I'm not familiar with the AWS profile config. It seems like if a profile config is passed in, we don't want to override other S3 options, such as region
in this case.
No Problem!!
This could potentially be a bug if we assume that the catalog and FileIO (S3) share the same aws profile configs. On one side, having a single profile configuration is convenient for the user's boto client, as it allows initializing all AWS clients with the correct credentials. However, on the other hand, we could argue that this configuration should only work at the catalog level, and for filesystems, separate configurations might be required. I'm inclined towards the first option. However, we are using pyarrow's S3FileSystem implementation, which has no concept of a aws profile. Therefore, we will need to initialize these values through boto's session.get_credentials() and pass them to the filesystem.
I'll raise an issue for this
thank you! should we close this in favor of #570?
I have tried both solution, ie:
Traceback (most recent call last):
File "/path/to/the/project/testing.py", line 15, in <module>
table = catalog.load_table((catalog_name, table_name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 473, in load_table
return self._convert_glue_to_iceberg(self._get_glue_table(database_name=database_name, table_name=table_name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 296, in _convert_glue_to_iceberg
metadata = FromInputFile.table_metadata(file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/serializers.py", line 112, in table_metadata
with input_file.open() as input_stream:
^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 263, in open
input_file = self._filesystem.open_input_file(self._path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_fs.pyx", line 780, in pyarrow._fs.FileSystem.open_input_file
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.
Interesting can you run aws sts get-caller-identity
in the terminal to ensure the right identity is being used?
you can also, explicitly set the S3FileIO by passing in the s3 configuration properties into the catalog.
catalog = pyiceberg.catalog.load_catalog(catalog_name,
**{
"type": "glue",
"profile": profile_name,
"s3.access-key-id": "access-key",
"s3.secret-access-key": "secret-access-key",
"s3.region": "us-east-1"
})
Interesting can you run
aws sts get-caller-identity
in the terminal to ensure the right identity is being used?you can also, explicitly set the S3FileIO by passing in the s3 configuration properties into the catalog.
catalog = pyiceberg.catalog.load_catalog(catalog_name, **{ "type": "glue", "profile": profile_name, "s3.access-key-id": "access-key", "s3.secret-access-key": "secret-access-key", "s3.region": "us-east-1" })
this worked for me when i also added the token information for the s3
catalog = load_catalog( "default", **{"type": "glue", "aws_access_key_id": "ASAXXXXXXXXXX", "aws_secret_access_key": "0VLxnXXXXXXXXXXX", "aws_session_token": "IQJb3JpXXXXXXXXXXXXXXXXXXXXXXXX", "s3.access-key-id": "ASAXXXXXXXXXX", "s3.secret-access-key": "0VLxnXXXXXXXXXXX", "s3.session-token": "IQJb3JpXXXXXXXXXXXXXXXXXXXXXXXX", "s3.region": "eu-west-1", "region_name": "eu-west-1" }, )
We have the same problem here. My manager and me tried to get it to work in parallel and both ran into the same error. We assumed it is a permission issue, but even with admin credentials it didn't work. We used access token, tried to set region manually, provided AWS profile name and alternatively the access keys. No success.
My guess is that it has something to do with the s3fs package used to read the metadata file.
We had the same problem within our Airflow deployment. The easy fix for us would have been to set the default aws credentials through environment variables:
AWS_ACCESS_KEY_ID=<aws region>
AWS_DEFAULT_REGION=<aws access key>
AWS_SECRET_ACCESS_KEY=<aws secret key>
This, however, wasn't feasible because of deployment issues. Long story short, we ended up with this solution:
glue_catalog_conf = {
"s3.region": <aws region>,
"s3.access-key-id": <aws access key>,
"s3.secret-access-key": <aws secret key>,
"region_name": <aws region>,
"aws_access_key_id": <aws access_key>,
"aws_secret_access_key": <aws secret key>,
}
catalog: GlueCatalog = load_catalog(
"some_name",
type="glue",
**glue_catalog_conf
)
If you come from a google search, please take everything that follows with a grain of salt, because we have no previous experience with either pyiceberg or airflow. Anyway.
We came to this conclusion (that we needed to pass both formats) because it seems that the the boto client initialization expects one format (the second set in the above snippet):
class GlueCatalog(Catalog):
def __init__(self, name: str, **properties: Any):
super().__init__(name, **properties)
session = boto3.Session(
profile_name=properties.get("profile_name"),
region_name=properties.get("region_name"),
botocore_session=properties.get("botocore_session"),
aws_access_key_id=properties.get("aws_access_key_id"),
aws_secret_access_key=properties.get("aws_secret_access_key"),
aws_session_token=properties.get("aws_session_token"),
)
self.glue: GlueClient = session.client("glue")
And the same set of properties is passed to the load_file_io
pyiceberg function, which, to the extent of our very limited understanding, expects the other format (s3.stuff
):
io = load_file_io(properties=self.properties, location=metadata_location)
file = io.new_input(metadata_location)
metadata = FromInputFile.table_metadata(file)
return Table(
identifier=(self.name, database_name, table_name),
metadata=metadata,
metadata_location=metadata_location,
io=self._load_file_io(metadata.properties, metadata_location),
catalog=self,
)
We might be completely off base here, of course, and what ultimately convinced us to adopt the above solution is just that it works, while passing either set of credentials without the other wouldn't work for us.
We're using:
aiobotocore==2.13.1
boto3==1.34.51
botocore==1.34.131
[...]
pyiceberg==0.6.1
We're still unclear on whether it's indeed a bug or we're just using the APIs improperly, any help would be appreciated.
Have a nice day!
@impproductions Thanks for the detailed explanation. Great catch!
Looking through the code, there's indeed an expectation for both AWS credential formats.
s3.access-key-id
vs aws_access_key_id
s3.secret-access-key
vs aws_secret_access_key
This issue exists for both glue
and dynamodb
catalogs
https://github.com/search?q=repo%3Aapache%2Ficeberg-python+aws_secret_access_key+path%3A.py+-path%3Atests&type=code
Opened #892 to track the issue with AWS credential formats
Question
PyIceberg version: 0.6.0 Python version: 3.11.1
Comments:
Hi,
I am facing issues loading iceberg tables from AWS Glue. The code I am using is as follow:
The code allow me to:
But load_table throw the following error:
I have checked I have the proper accesses, but it wasn't the issue. I have tried a few other things but they were all unsuccessful.
The table definition is as follow and was created via Trino: