GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
GNU Affero General Public License v3.0
550 stars 36 forks source link

Unable to add gcs and s3 objects as external tables #1065

Closed scsmithr closed 2 months ago

scsmithr commented 1 year ago

Context

Outside google cloud:

2023-05-31T17:42:49.731656Z ERROR                 main ThreadId(01) testing::slt::cli: crates/testing/src/slt/cli.rs:244: Error while running test `sqllogictests/demos/replicant` error=test fail: statement failed: db error: ERROR: External table validation failed: Generic GCS error: Error performing get request tyrell/voight_kampff.csv: response error "{
  "error": {
    "code": 401,
    "message": "Invalid Credentials",
    "errors": [
      {
        "message": "Invalid Credentials",
        "domain": "global",
        "reason": "authError",
        "locationType": "header",
        "location": "Authorization"
      }
    ]
  }
}
", after 0 retries: HTTP status client error (401 Unauthorized) for url (https://storage.googleapis.com/storage/v1/b/glaredb%2Ddemos/o/tyrell%2Fvoight%5Fkampff%2Ecsv?alt=json)

Inside google cloud:

External table validation failed: Generic GCS error: Error performing get request tyrell/voight_kampff.csv: response error "{ "error": { "code": 412, "message": "The type of authentication token used for this request requires that Uniform Bucket Level Access be enabled.", "errors": [ { "message": "The type of authentication token used for this request requires that Uniform Bucket Level Access be enabled.", "domain": "global", "reason": "conditionNotMet", "locationType": "header", "location": "If-Match" } ] } } ", after 0 retries: HTTP status client error (412 Precondition Failed) for url (https://storage.googleapis.com/storage/v1/b/glaredb%2Ddemos/o/tyrell%2Fvoight%5Fkampff%2Ecsv?alt=json)

The object_store crate tries to dial the metadata service when not provided application default credentials or a service account. For objects open to public, this is unnecessary, but the version we're using (0.5.6) does not have a way of disabling that dial.

Version 0.6 does allow slotting in a custom authenticator which we could used. However, datafusion is stuck on 0.5.6 right now, so we're kinda stuck for now.

Expected

Actual

Impact

greyscaled commented 12 months ago
vrongmeal commented 11 months ago

This works with the updated version. No changes are required, right?

glaredb=> create external table def from gcs options ( bucket = 'vrongmeal-public-test', location = 'userdata1.parquet' );
CREATE TABLE
scsmithr commented 11 months ago

If you're testing this locally, it's probably working because it's because it's picking up application default credentials. Removing ~/.config/gcloud/application_default_credentials.json would probably make this fail with the failure to hit the metadata server.

vrongmeal commented 11 months ago

I added a NullCredentailProvider to resolve this.

There's a tiny issue, and it's to do with GCS. Since the object store always sends the header Authorization: Bearer ... (even when it's empty), the first time a table is accessed, the request fails. Once the table is accessed without the header, subsequent requests pass. This is probably because GCS keeps a cache of accessed objects later on, so we can't guarantee that it works every time.

vrongmeal commented 11 months ago

Created an issue (and submitted a PR) on object store: https://github.com/apache/arrow-rs/issues/4417

greyscaled commented 11 months ago

Waiting on DF updates. Right now we don't have a great way to manage these deps, so we're waiting. In the future we might want to think about it more in depth.

vrongmeal commented 10 months ago

We'll have to wait for another release, I guess. The latest object_store release doesn't have this.

vrongmeal commented 10 months ago

A similar problem exists with S3. We should fix that as well.

greyscaled commented 9 months ago

@vrongmeal is this one still blocked?

greyscaled commented 9 months ago

Moving off of current & next sprint (will come after 0.5.0)

tychoish commented 2 months ago

@vrongmeal @greyscaled Wanted to check in on this... I think this is probably good now?

vrongmeal commented 2 months ago

I'll check this today. I think GCS should be good. Doubtful about S3 (and Azure, this issue precedes Azure support)

vrongmeal commented 2 months ago

GCS works (raising a PR for NULL Credentials):

> select * from 'gs://vrongmeal-public-test/data.csv';
┌───────┬─────────┐
│    id │ name    │
│    ── │ ──      │
│ Int64 │ Utf8    │
╞═══════╪═════════╡
│     1 │ vaibhav │
│     2 │ sean    │
│     3 │ grey    │
└───────┴─────────┘

Need to make a similar change upstream for S3 and Azure.