delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.3k stars 404 forks source link

pyo3_runtime.PanicException: not stream while reading DeltaTable #809

Closed AlessandroBiagi closed 7 months ago

AlessandroBiagi commented 2 years ago

Environment

Delta-rs version: 0.6.1

Binding: python

Environment:


Bug

What happened: Reading a delta table with:

DeltaTable( table_uri=self.path_to_table, storage_options={"AWS_REGION": aws_region, "AWS_ENDPOINT_URL": f"s3.{aws_region}.amazonaws.com", "AWS_ACCESS_KEY_ID": f"{aws_access_key_id}", "AWS_SECRET_ACCESS_KEY": f"{aws_secret_access_key}"} )

yields to the following error:

thread  '<unnamed>' panicked at 'not stream', /root/.cargo/git/checkouts/arrow-rs-25d656fcab36794a/5f56754/object_store/src/aws/credential.rs:170:14
stack backtrace:
   0:     0x7f168654043d - <unknown>
   1:     0x7f16865682bc - <unknown>
   2:     0x7f168653b301 - <unknown>
   3:     0x7f1686541bb5 - <unknown>
   4:     0x7f16865418d6 - <unknown>
   5:     0x7f1686542146 - <unknown>
   6:     0x7f1686542037 - <unknown>
   7:     0x7f16865408f4 - <unknown>
   8:     0x7f1686541d69 - <unknown>
   9:     0x7f16859da073 - <unknown>
  10:     0x7f1686565021 - <unknown>
  11:     0x7f1686564fcb - <unknown>
  12:     0x7f16859d9ee6 - <unknown>
  13:     0x7f1685e383e3 - <unknown>
  14:     0x7f1685e5ad2b - <unknown>
  15:     0x7f1685e5f927 - <unknown>
  16:     0x7f1685b1a192 - <unknown>
  17:     0x7f1685b24980 - <unknown>
  18:     0x7f1685a05e16 - <unknown>
  19:     0x7f1685a0703c - <unknown>
  20:     0x7f1685a06627 - <unknown>
  21:     0x7f1685a56b2e - <unknown>
  22:     0x7f1685a2d33d - <unknown>
  23:     0x7f1685a3aad4 - <unknown>
  24:     0x7f1685a6492d - <unknown>
  25:     0x7f1685a73c92 - <unknown>
  26:     0x7f16859f1bcd - <unknown>
  27:     0x7f1685a79972 - <unknown>
  28:           0x5f3d03 - _PyObject_MakeTpCall
  29:           0x570af9 - _PyEval_EvalFrameDefault
  30:           0x56939a - _PyEval_EvalCodeWithName
  31:           0x5f6a13 - _PyFunction_Vectorcall
  32:           0x59bfb7 - <unknown>
  33:           0x5f3d7f - _PyObject_MakeTpCall
  34:           0x570af9 - _PyEval_EvalFrameDefault
  35:           0x56939a - _PyEval_EvalCodeWithName
  36:           0x68d047 - PyEval_EvalCode
  37:           0x67e351 - <unknown>
  38:           0x67e3cf - <unknown>
  39:           0x67e471 - <unknown>
  40:           0x67e817 - PyRun_SimpleFileExFlags
  41:           0x6b6fe2 - Py_RunMain
  42:           0x6b736d - Py_BytesMain
  43:     0x7f16879640b3 - __libc_start_main
  44:           0x5fa5ce - _start
  45:                0x0 - <unknown>

Traceback (most recent call last):
  File "test.py", line 15, in <module>
    dt = DeltaTable(table_uri=path,
  File "<local_path>/python3.8/site-packages/deltalake/table.py", line 91, in __init__
    self._table = RawDeltaTable(
pyo3_runtime.PanicException: not stream

What you expected to happen: Read delta table successfully

How to reproduce it: see what happened

More details:

roeap commented 2 years ago

Hmm that is a bit of a puzzle. The error occurs quite deep into the underlying object storage. We will have to explore a bit to get to the root of this.

I am not too familiar, with AWS settings, but could we start by trying to simplify these. First, omit AWS_ENDPOINT_URL, since it is mostly useful for non-standard endpoints like with minio or working with localstack. Essentially try something like this for storage options.

storage_options = {
    "AWS_REGION": aws_region,
    "AWS_ACCESS_KEY_ID": aws_access_key_id,
    "AWS_SECRET_ACCESS_KEY":  aws_secret_access_key,
}

The actual format of the endpoint url also includes the scheme "https://s3.<region>.amazonaws.com", so if that actually is the case, we should probably try and catch that sooner, and have some more informative error.

AlessandroBiagi commented 2 years ago

Hi @roeap , thanks for your reply. So, I tried to launch the code twice, the first one changing the url scheme to "https://s3.<region>.amazonaws.com" and the second one removing the AWS_ENDPOINT_URL. In both cases I receive the following error:

Traceback (most recent call last):
  File "test.py", line 17, in <module>
    dt = DeltaTable(table_uri=path,
  File "<local_path>/lib/python3.8/site-packages/deltalake/table.py", line 91, in __init__
    self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Failed to read delta log object: Generic S3 error: Error performing get request <correct_s3_path>/_delta_log/00000000000000000000.json: response error "No Body", after 0 retries: HTTP status client error (403 Forbidden) for url (https://s3.<aws_region>.amazonaws.com/<correct_s3_path>/_delta_log/00000000000000000000.json)
roeap commented 2 years ago

the response status says "403 Forbidden". The good news is, we are talking to the service, and we are getting a response back, so that part of the config is correct now. As the status implies, there is something off with the credentials. The failing request is a simple get, so read rights should suffice. Not sure how AWS handles it, but "list" is also required and sometimes assigned separately.

AlessandroBiagi commented 2 years ago

mmh @roeap not sure what is failing in terms of permissions then. It is true that ideally I want to access with a role and setting my storage options as

storage_options = {
    "AWS_REGION": aws_region,
    "AWS_ROLE_ARN": aws_role_arn
}

gives me a

Traceback (most recent call last):
  File "test.py", line 17, in <module>
    dt = DeltaTable(table_uri=path,
  File "<local_path>/lib/python3.8/site-packages/deltalake/table.py", line 91, in __init__
    self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Failed to load checkpoint: Failed to read checkpoint content: Generic S3 error: response error "request error", after 0 retries: error sending request for url (http://169.254.169.254/latest/api/token): error trying to connect: tcp connect error: No route to host (os error 113)

My role definitely has full reading permissions, since I can read and download the files of my uri_path for instance. It seems like it's trying to get the credentials from the metadata but I'm running this locally for now.

roeap commented 2 years ago

i don't have much experience with AWS, but checked the code in the underlying object store crate. The metadata api is hardcoded, the unit tests lose a local instance to mock the metadata endpoint, but that is only used in the tests on lower level functions.

Not sure how using the metadata endpoint locally works. but as of now, we can unfortunately not support it. One way I could think of if you need to het tit to work locally, is to edit yout host file, so the the ip gets resolved to whatever local instance you have running.

If oyu can get an access key, i guess that would be easier, then again, I never really woked with aws beyond tinkering :)

houqp commented 2 years ago

Are you able to list the S3 object <correct_s3_path>/_delta_log/00000000000000000000.json using aws cli with the same AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables?

For the metadata endpoint failure (with role access), are you sure you ran the code inside an AWS environment like EC2 instance?

AlessandroBiagi commented 2 years ago

Hi @roeap @houqp. So in my case is not the account that has the permissions to list the objects in the bucket, it's the aws role inside the account that has the permissions, this is why the most natural thing for me should be that the code works with the AWS_ROLE_ARN parameter. As I said, it seems like it's trying to get the credentials from the metadata and not looking locally. I'm testing this locally, but I do have locally the aws .credentials file with all the credentials the usual aws-compatible libraries use for accessing aws. The idea would be to use this library inside an aws lambda for the moment, not EC2. For instance, the delta lake reader library was able to read and access data without problems, then I had other issues and also for future concerns about maintenance I decided to try the library officially supported by DeltaLake.

tafaust commented 1 year ago

I experience the same error with my silicon Mac M1. I want to write a DeltaTable onto a clean/empty AWS S3.

First, I tried to go for

# configuration
session = boto3.Session(profile_name='myprofile-with-read-write-access')
credentials = session.get_credentials()
s3_key = 'mykey'
s3_bucket = 'mybucket'
aws_region = 'eu-central-1'
storage_options = {
    'AWS_ACCESS_KEY_ID': credentials.access_key,
    'AWS_SECRET_ACCESS_KEY': credentials.secret_key,
    'AWS_REGION': aws_region,
    'AWS_ENDPOINT_URL': f'{s3_bucket}.s3-{aws_region}.amazonaws.com'
}
delta_table_config = dict(
    table_uri=f's3://{s3_bucket}/delta/{s3_key}',
    version=None,
    storage_options=storage_options,
)

dt = DeltaTable(**delta_table_config)

and I got a similar error as in https://github.com/delta-io/delta-rs/issues/564:

PyDeltaTableError: Not a Delta table: No snapshot or version 0 found, perhaps s3://mybucket/delta/mykey is an empty dir?

Then, I tried to write a delta table to the S3 directory with the config variables from above:

df = pd.DataFrame(
    data={
        'some_int': [1, 2, 3],
        'some_string': ['one', 'two', 'three']
    },
)
write_deltalake(
    table_or_uri=delta_table_config['table_uri'],
    data=df,
    storage_options=storage_options,
)

where I receive the following error:

thread '\' panicked at 'not stream', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/object_store-0.5.0/src/aws/credential.rs:170:14

If I comment the AWS_ENDPOINT_URL, I receive a 403 when doing the multipart upload:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Generic { store: "S3", source: CreateMultipartRequest { source: Error { retries: 0, message: "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>SignatureDoesNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message><AWSAccessKeyId>REDACTED</AWSAccessKeyId><StringToSign>REDACTED</StringToSign><SignatureProvided>REDACTED</SignatureProvided><StringToSignBytes>REDACTED</StringToSignBytes><CanonicalRequest>POST\n/mybucket/delta/mykey/0-259f66b2-a287-4346-bf0d-35352ab4685e-0.parquet\nuploads=\nhost:s3.eu-central-1.amazonaws.com\nx-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\nx-amz-date:20221108T133501Z\n\nhost;x-amz-content-sha256;x-amz-date\ne3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855</CanonicalRequest><CanonicalRequestBytes>REDACTED FOR BREVITY</CanonicalRequestBytes><RequestId>REDACTED</RequestId><HostId>REDACTED</HostId></Error>", source: reqwest::Error { kind: Status(403), url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("s3.eu-central-1.amazonaws.com")), port: None, path: "/mybucket/delta/mykey/0-259f66b2-a287-4346-bf0d-35352ab4685e-0.parquet", query: Some("uploads"), fragment: None } } } } }', python/src/filesystem.rs:404:71

I hope that my dump helps in debugging deltalake. I will try to create the empty folders on the S3 now and see if this helps.

tafaust commented 1 year ago

I just created the folders on the cli

aws s3api put-object --profile myprofile-with-read-write-access --bucket mybucket --key delta/mykey

and got the same result as above.


EDIT: My current workaround is to create the DeltaTable locally and copy the files to S3 using the aws cli. You only have to do this when you need to change the schema.

tommydangerous commented 1 year ago

I’m getting similar issues as well.

Thelin90 commented 1 year ago

I am experiencing the same issue.

I am trying to test delta-rs 0.8.1 with a local k8s MinIO.

I get

thread '<unnamed>' panicked at 'not stream', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/object_store-0.5.5/src/aws/credential.rs:173:27
  File "/Users/simon/Library/Caches/pypoetry/virtualenvs/heureka-MBcEyvac-py3.9/lib/python3.9/site-packages/deltalake/table.py", line 122, in __init__
    self._table = RawDeltaTable(
pyo3_runtime.PanicException: not stream
storage_options = {"AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
"AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
"AWS_REGION": aws_region,
"AWS_ENDPOINT_URL": "localhost:30000"  # I also tried, trino-minio-svc.trino:9000, and host.docker.internal:30000
}

I know my delta table in MinIO works fine because I also use it for Trino, I was keen on testing with MinIO here instead of AWS S3.

DeltaTable(
    table_uri=table_uri,
    storage_options=self.storage_options,
    version=version,
)
Thelin90 commented 1 year ago

I resolved this issue by defining my local minio in K8S like:

apiVersion: v1
kind: Service
metadata:
  name: trino-minio-svc
  namespace: trino
spec:
  type: NodePort
  ports:
    - name: "9000"
      port: 9000
      targetPort: 9000
      nodePort: 30000
    - name: "9001"
      port: 9001
      targetPort: 9001
      nodePort: 30001
  selector:
    app: minio

I then simply used:

    @staticmethod
    def _setup_storage_options(aws_region: str) -> Dict[str, str]:
        os.environ["AWS_S3_ALLOW_UNSAFE_RENAME"] = "true"
        os.environ["AWS_STORAGE_ALLOW_HTTP"] = "1"

        return {
            "AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
            "AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
            "AWS_REGION": aws_region,
            "AWS_ENDPOINT_URL": "http://localhost:30000",
        }

And this worked. It was also important to use port 9000 -> 30000 since this is MinIO's API port. 9001 -> 30001 is the UI port.

LilMonk commented 9 months ago

I'm facing this same issue. Not able to save the dataframe to delta table in AWS S3 bucket.

LilMonk commented 9 months ago

I have solved it use this:

delta_table = DeltaTable(table_location) # table_location is s3://<table_location>
write_deltalake(
    delta_table,
    df,
    mode="append",
    storage_options=storage_options,
)

It won't work if you are trying to do something like this:

write_deltalake(
    table_location, # String table location.
    df,
    mode="append",
    storage_options=storage_options,
)

There is some internal implementation error I guess.

ion-elgreco commented 7 months ago

Seems to be a very old issue, closing this one : ) Please tag me if you are still phasing it then I'll reopen