apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.46k stars 2.23k forks source link

Accessing Minio with Pyiceberg #10709

Open muniatl opened 4 months ago

muniatl commented 4 months ago

Query engine

No response

Question

I have a piece of code which is working with S3 endpoint and a Sql Catalog with sqlite. However for testing, I want to be able to run it against a minio deployment that's hosted and running on localhost. I have tried various options with no luck. What are the parameters I need to pass to SqlCatalog and create_table? My code looks like this: catalog = SqlCatalog( "default", **{ "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",

"uri" : f"postgresql+psycopg2://postgres:ph1@localhost:5433/template1",

    "warehouse": "s3://127.0.0.1:9000/iceberg", # have tried "s3://iceberg" "s3://127.0.0.1/iceberg" and completely commenting out warehouse
    "s3.endpoint" : "s3://127.0.0.1:9000",
    #"minio-root-user": "admin",
    #"minio-root-password": "password",
    #"minio-domain" : "minio",
    #"s3.access-key-id": "admin",
    #"s3.secret-access-key": "password",

}, )

table = catalog.create_table( "default1.taxi_dataset", schema=df.schema, ) _OSError: When getting information for key 'iceberg/default1.db/taxi_dataset/metadata/00000-671ce9cf-73ff-49a2-a22e-408d8758625b.metadata.json' in bucket '127.0.0.1:9000': AWS Error NETWORKCONNECTION during HeadObject operation: curlCode: 6, Couldn't resolve host name.

I am able to access minio server, login and able to even upload files. Any pointers on what are the valid properties to pass for minio much appreciated

rggyanav commented 3 months ago

@muniatl - I think the MinIO endpoints should not use the s3:// prefix for the endpoint configuration. They should instead use the HTTP/HTTPS protocol. e.g: warehouse="s3://iceberg", # Correct S3 URI format without the endpoint s3_endpoint="http://127.0.0.1:9000", # Corrected MinIO endpoint

Could you please try this?

cfrancois7 commented 2 months ago

I tried something similar with my local config:

 from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.catalog import load_catalog

warehouse_path = "local_s3"
catalog = SqlCatalog(
    "catalog_1",
    **{
        "uri": f"sqlite:///{warehouse_path}/catalog.db",
        "warehouse":"s3://iceberg",
        "s3.endpoint": "http://localhost:9001",
        "s3.access-key-id": "minio_user",
        "s3.secret-access-key": "minio1234",
    },
)
catalog.create_namespace_if_not_exists('test')

And then , the creation of the table raise one error.

# Define Schema for Projects Table
projects_schema = pa.schema([
    pa.field('id', pa.uint8(), nullable=False),
    pa.field('name', pa.string(), nullable=False),
    pa.field('description', pa.string()),
    pa.field('creation_date', pa.timestamp('s')),
    pa.field('modification_date', pa.timestamp('s'))
])
projects_table = catalog.create_table_if_not_exists(
    'test.projects', 
    schema=projects_schema,
)

The error:

OSError: When getting information for key 'test.db/projects/metadata/00000-5a3bb77f-7161-4bfe-a7af-b823f6f0cb71.metadata.json' in bucket 'iceberg': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.