delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.36k stars 416 forks source link

[python] tcp connect error reading from a public s3 bucket with `{"anon": "true"}` #1554

Open j-bennet opened 1 year ago

j-bennet commented 1 year ago

Environment

Delta-rs version: 0.10.0

Binding: Python

Environment:


Bug

What happened:

Can't read a table from a public s3 bucket:

from deltalake import DeltaTable
storage_options = {"AWS_REGION": "us-east-2", "anon": "true"}
dt = DeltaTable("s3://coiled-datasets/h2o-delta/N_1e7_K_1e2/", storage_options=storage_options)

Error looks like this:

Traceback (most recent call last):
  File "/Users/jbennet/src/dask-deltatable/t7.py", line 10, in <module>
    dt = DeltaTable(uri, storage_options=storage_options)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jbennet/mambaforge/envs/dask-deltatable/lib/python3.11/site-packages/deltalake/table.py", line 238, in __init__
    self._table = RawDeltaTable(
                  ^^^^^^^^^^^^^^
OSError: Generic S3 error: response error "request error", after 0 retries: error sending request for url (http://169.254.169.254/latest/api/token): error trying to connect: tcp connect error: No route to host (os error 65)

Setting AWS_ENDPOINT_URL doesn't help.

What you expected to happen:

DeltaTable instance initialized.

How to reproduce it:

Code snippet above.

More details:

ognis1205 commented 1 year ago

Relating issues: https://github.com/delta-io/delta-rs/issues/809

Relating threads: https://delta-users.slack.com/archives/C013LCAEB98/p1688688536894189

rtyler commented 1 year ago

I can certainly confirm that this still exists. This isn't a problem in the Python or Rust layer, but in fact a problem with object_store. Here's an example that reproduces it:

use object_store::aws::AmazonS3Builder;
use object_store::ObjectStore;
use futures::stream::StreamExt;

#[tokio::main]
async fn main() -> deltalake::DeltaResult<()> {
    // s3://coiled-datasets/h2o-delta/N_1e7_K_1e2/
    let s3 = AmazonS3Builder::from_env()
        .with_bucket_name("coiled-datasets")
        .with_region("us-east-2")
        .build()?;

    let mut stream = s3.list(None).await?;
    println!("Reading list stream");

    while let Some(result)= stream.next().await {
        println!("listed: {result:?}");
    }

    Ok(())
}

Output

Reading list stream
listed: Err(Generic { store: "S3", source: Error { retries: 1, message: "request error", source: Some(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(169.254.169.254)), port: None, path: "/latest/api/token", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 110, kind: TimedOut, message: "Connection timed out" })) }), status: None } })

The origination seems to come from here. Basically the object_store crate does not accept the possibility of credentials missing and that being okay at the moment, so an upstream fix is going to need to be made.