GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
GNU Affero General Public License v3.0
637 stars 37 forks source link

Chore: Document how to connect to S3/GCS bucket from cli and python lib #1880

Open scsmithr opened 11 months ago

scsmithr commented 11 months ago

Description

I'm not sure what it looks like right now.

import glaredb

# What do I put for storage options?
con = glaredb.connect('gs://bucket/path', storage_options=?)

Can just leave stuff in comments here and we can transfer over to docs whenever.

gruuya commented 11 months ago

Here is a brief tour of the new options (from an actual IPython session):

In [1]: import glaredb

In [2]: con_1 = glaredb.connect(location="memory://")  # connect to in-memory store

In [3]: con_1.sql("create table test_1 as values (1, 'one'), (2, 'two')")
Out[3]: Noop

In [4]: con_1.sql("select * from test_1").to_arrow()
Out[4]:
pyarrow.Table
column1: int64
column2: string
----
column1: [[1,2]]
column2: [["one","two"]]

In [5]: con_2 = glaredb.connect(location="../../data-dir").  # connect to local file system

In [6]: con_2.sql("create table test_2 as values (3, 'three'), (4, 'four')")
Out[6]: Noop

In [7]: con_2.sql("select * from test_2").to_arrow()
Out[7]:
pyarrow.Table
column1: int64
column2: string
----
column1: [[3,4]]
column2: [["three","four"]]

In [8]: !tree ../../data-dir
../../data-dir
└── databases
    └── 00000000-0000-0000-0000-000000000000
        ├── tables
        │   └── 20000
        │       ├── _delta_log
        │       │   ├── 00000000000000000000.json
        │       │   └── 00000000000000000001.json
        │       └── part-00001-f3852b16-ee29-403e-9926-e56d6689bbaa-c000.snappy.parquet
        ├── tmp
        │   └── 11cbfd08-fa8c-47fb-abc4-8a2bee166220
        └── visible
            ├── catalog.0
            ├── catalog.1
            ├── lease
            └── metadata

8 directories, 7 files

In [9]: con_3 = glaredb.connect(
    ...: location="gs://glaredb-test-bucket/path/to/some/folder",
    ...: storage_options=dict(service_account_path="/tmp/fake-gcs-creds.json")
    ...: ) # connect to a fake GCS server, and use a specific path

In [10]: con_3.sql("create table test_3 as values (5, 'five'), (6, 'six')")
Out[10]: Noop

In [11]: con_3.sql("select * from test_3").to_arrow()
Out[11]:
pyarrow.Table
column1: int64
column2: string
----
column1: [[5,6]]
column2: [["five","six"]]

In [12]: !curl -s --insecure http://0.0.0.0:4443/storage/v1/b/glaredb-test-bucket/o | jq .
{
  "kind": "storage#objects",
  "items": [
    {
      "kind": "storage#object",
      "name": "path/to/some/folder/databases/00000000-0000-0000-0000-000000000000/tables/20000/_delta_log/00000000000000000000.json",
      "id": "glaredb-test-bucket/path/to/some/folder/databases/00000000-0000-0000-0000-000000000000/tables/20000/_delta_log/00000000000000000000.json",
      "bucket": "glaredb-test-bucket",
      "size": "1223",
      "crc32c": "94JF+A==",
      "md5Hash": "g/35a5HOkx0W0uaZmmSIMw==",
      "etag": "\"g/35a5HOkx0W0uaZmmSIMw==\"",
      "timeCreated": "2023-10-19T09:06:30.710542Z",
      "updated": "2023-10-19T09:06:30.710552Z",
      "generation": "1697706390710661"
    },
    ...
  ]
}

In [13]: con_4 = glaredb.connect(
    ...: location="http://localhost:9000/glaredb-test-bucket/some/sub/directory", 
    ...: storage_options={"access_key_id": "glaredb", "secret_access_key": "glaredb_test"}
    ...: ) # connect to a MinIO server to test the S3 object store family

In [14]: con_4.sql("create table test_4 as values (7, 'seven'), (8, 'eight')")
Out[14]: Noop

In [15]: con_4.sql("select * from test_4").to_arrow()
Out[15]:
pyarrow.Table
column1: int64
column2: string
----
column1: [[7,8]]
column2: [["seven","eight"]]

The storage_options kwarg can be omitted, and then the required params will be inferred from the environment (GOOGLE_APPLICATION_CREDENTIALS for GCP and AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY for S3)