Closed AlessandroBiagi closed 7 months ago
Hmm that is a bit of a puzzle. The error occurs quite deep into the underlying object storage. We will have to explore a bit to get to the root of this.
I am not too familiar, with AWS settings, but could we start by trying to simplify these. First, omit AWS_ENDPOINT_URL
, since it is mostly useful for non-standard endpoints like with minio or working with localstack. Essentially try something like this for storage options.
storage_options = {
"AWS_REGION": aws_region,
"AWS_ACCESS_KEY_ID": aws_access_key_id,
"AWS_SECRET_ACCESS_KEY": aws_secret_access_key,
}
The actual format of the endpoint url also includes the scheme "https://s3.<region>.amazonaws.com"
, so if that actually is the case, we should probably try and catch that sooner, and have some more informative error.
Hi @roeap , thanks for your reply. So, I tried to launch the code twice, the first one changing the url scheme to "https://s3.<region>.amazonaws.com"
and the second one removing the AWS_ENDPOINT_URL. In both cases I receive the following error:
Traceback (most recent call last):
File "test.py", line 17, in <module>
dt = DeltaTable(table_uri=path,
File "<local_path>/lib/python3.8/site-packages/deltalake/table.py", line 91, in __init__
self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Failed to read delta log object: Generic S3 error: Error performing get request <correct_s3_path>/_delta_log/00000000000000000000.json: response error "No Body", after 0 retries: HTTP status client error (403 Forbidden) for url (https://s3.<aws_region>.amazonaws.com/<correct_s3_path>/_delta_log/00000000000000000000.json)
the response status says "403 Forbidden". The good news is, we are talking to the service, and we are getting a response back, so that part of the config is correct now. As the status implies, there is something off with the credentials. The failing request is a simple get, so read rights should suffice. Not sure how AWS handles it, but "list" is also required and sometimes assigned separately.
mmh @roeap not sure what is failing in terms of permissions then. It is true that ideally I want to access with a role and setting my storage options as
storage_options = {
"AWS_REGION": aws_region,
"AWS_ROLE_ARN": aws_role_arn
}
gives me a
Traceback (most recent call last):
File "test.py", line 17, in <module>
dt = DeltaTable(table_uri=path,
File "<local_path>/lib/python3.8/site-packages/deltalake/table.py", line 91, in __init__
self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Failed to load checkpoint: Failed to read checkpoint content: Generic S3 error: response error "request error", after 0 retries: error sending request for url (http://169.254.169.254/latest/api/token): error trying to connect: tcp connect error: No route to host (os error 113)
My role definitely has full reading permissions, since I can read and download the files of my uri_path for instance. It seems like it's trying to get the credentials from the metadata but I'm running this locally for now.
i don't have much experience with AWS, but checked the code in the underlying object store crate. The metadata api is hardcoded, the unit tests lose a local instance to mock the metadata endpoint, but that is only used in the tests on lower level functions.
Not sure how using the metadata endpoint locally works. but as of now, we can unfortunately not support it. One way I could think of if you need to het tit to work locally, is to edit yout host file, so the the ip gets resolved to whatever local instance you have running.
If oyu can get an access key, i guess that would be easier, then again, I never really woked with aws beyond tinkering :)
Are you able to list the S3 object <correct_s3_path>/_delta_log/00000000000000000000.json
using aws cli with the same AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables?
For the metadata endpoint failure (with role access), are you sure you ran the code inside an AWS environment like EC2 instance?
Hi @roeap @houqp. So in my case is not the account that has the permissions to list the objects in the bucket, it's the aws role inside the account that has the permissions, this is why the most natural thing for me should be that the code works with the AWS_ROLE_ARN parameter. As I said, it seems like it's trying to get the credentials from the metadata and not looking locally. I'm testing this locally, but I do have locally the aws .credentials file with all the credentials the usual aws-compatible libraries use for accessing aws. The idea would be to use this library inside an aws lambda for the moment, not EC2. For instance, the delta lake reader library was able to read and access data without problems, then I had other issues and also for future concerns about maintenance I decided to try the library officially supported by DeltaLake.
I experience the same error with my silicon Mac M1. I want to write a DeltaTable
onto a clean/empty AWS S3.
First, I tried to go for
# configuration
session = boto3.Session(profile_name='myprofile-with-read-write-access')
credentials = session.get_credentials()
s3_key = 'mykey'
s3_bucket = 'mybucket'
aws_region = 'eu-central-1'
storage_options = {
'AWS_ACCESS_KEY_ID': credentials.access_key,
'AWS_SECRET_ACCESS_KEY': credentials.secret_key,
'AWS_REGION': aws_region,
'AWS_ENDPOINT_URL': f'{s3_bucket}.s3-{aws_region}.amazonaws.com'
}
delta_table_config = dict(
table_uri=f's3://{s3_bucket}/delta/{s3_key}',
version=None,
storage_options=storage_options,
)
dt = DeltaTable(**delta_table_config)
and I got a similar error as in https://github.com/delta-io/delta-rs/issues/564:
PyDeltaTableError: Not a Delta table: No snapshot or version 0 found, perhaps s3://mybucket/delta/mykey is an empty dir?
Then, I tried to write a delta table to the S3 directory with the config variables from above:
df = pd.DataFrame(
data={
'some_int': [1, 2, 3],
'some_string': ['one', 'two', 'three']
},
)
write_deltalake(
table_or_uri=delta_table_config['table_uri'],
data=df,
storage_options=storage_options,
)
where I receive the following error:
thread '\
' panicked at 'not stream', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/object_store-0.5.0/src/aws/credential.rs:170:14
If I comment the AWS_ENDPOINT_URL
, I receive a 403 when doing the multipart upload:
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Generic { store: "S3", source: CreateMultipartRequest { source: Error { retries: 0, message: "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>SignatureDoesNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message><AWSAccessKeyId>REDACTED</AWSAccessKeyId><StringToSign>REDACTED</StringToSign><SignatureProvided>REDACTED</SignatureProvided><StringToSignBytes>REDACTED</StringToSignBytes><CanonicalRequest>POST\n/mybucket/delta/mykey/0-259f66b2-a287-4346-bf0d-35352ab4685e-0.parquet\nuploads=\nhost:s3.eu-central-1.amazonaws.com\nx-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\nx-amz-date:20221108T133501Z\n\nhost;x-amz-content-sha256;x-amz-date\ne3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855</CanonicalRequest><CanonicalRequestBytes>REDACTED FOR BREVITY</CanonicalRequestBytes><RequestId>REDACTED</RequestId><HostId>REDACTED</HostId></Error>", source: reqwest::Error { kind: Status(403), url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("s3.eu-central-1.amazonaws.com")), port: None, path: "/mybucket/delta/mykey/0-259f66b2-a287-4346-bf0d-35352ab4685e-0.parquet", query: Some("uploads"), fragment: None } } } } }', python/src/filesystem.rs:404:71
I hope that my dump helps in debugging deltalake. I will try to create the empty folders on the S3 now and see if this helps.
I just created the folders on the cli
aws s3api put-object --profile myprofile-with-read-write-access --bucket mybucket --key delta/mykey
and got the same result as above.
EDIT: My current workaround is to create the DeltaTable locally and copy the files to S3 using the aws cli. You only have to do this when you need to change the schema.
I’m getting similar issues as well.
I am experiencing the same issue.
I am trying to test delta-rs 0.8.1 with a local k8s MinIO.
I get
thread '<unnamed>' panicked at 'not stream', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/object_store-0.5.5/src/aws/credential.rs:173:27
File "/Users/simon/Library/Caches/pypoetry/virtualenvs/heureka-MBcEyvac-py3.9/lib/python3.9/site-packages/deltalake/table.py", line 122, in __init__
self._table = RawDeltaTable(
pyo3_runtime.PanicException: not stream
storage_options = {"AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
"AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
"AWS_REGION": aws_region,
"AWS_ENDPOINT_URL": "localhost:30000" # I also tried, trino-minio-svc.trino:9000, and host.docker.internal:30000
}
I know my delta table in MinIO works fine because I also use it for Trino, I was keen on testing with MinIO here instead of AWS S3.
DeltaTable(
table_uri=table_uri,
storage_options=self.storage_options,
version=version,
)
I resolved this issue by defining my local minio in K8S like:
apiVersion: v1
kind: Service
metadata:
name: trino-minio-svc
namespace: trino
spec:
type: NodePort
ports:
- name: "9000"
port: 9000
targetPort: 9000
nodePort: 30000
- name: "9001"
port: 9001
targetPort: 9001
nodePort: 30001
selector:
app: minio
I then simply used:
@staticmethod
def _setup_storage_options(aws_region: str) -> Dict[str, str]:
os.environ["AWS_S3_ALLOW_UNSAFE_RENAME"] = "true"
os.environ["AWS_STORAGE_ALLOW_HTTP"] = "1"
return {
"AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
"AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
"AWS_REGION": aws_region,
"AWS_ENDPOINT_URL": "http://localhost:30000",
}
And this worked. It was also important to use port 9000 -> 30000
since this is MinIO's API port. 9001 -> 30001
is the UI
port.
I'm facing this same issue. Not able to save the dataframe to delta table in AWS S3 bucket.
I have solved it use this:
delta_table = DeltaTable(table_location) # table_location is s3://<table_location>
write_deltalake(
delta_table,
df,
mode="append",
storage_options=storage_options,
)
It won't work if you are trying to do something like this:
write_deltalake(
table_location, # String table location.
df,
mode="append",
storage_options=storage_options,
)
There is some internal implementation error I guess.
Seems to be a very old issue, closing this one : ) Please tag me if you are still phasing it then I'll reopen
Environment
Delta-rs version: 0.6.1
Binding: python
Environment:
Bug
What happened: Reading a delta table with:
DeltaTable( table_uri=self.path_to_table, storage_options={"AWS_REGION": aws_region, "AWS_ENDPOINT_URL": f"s3.{aws_region}.amazonaws.com", "AWS_ACCESS_KEY_ID": f"{aws_access_key_id}", "AWS_SECRET_ACCESS_KEY": f"{aws_secret_access_key}"} )
yields to the following error:
What you expected to happen: Read delta table successfully
How to reproduce it: see what happened
More details: