apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
375 stars 137 forks source link

Docs: PostgreSql integration #78

Closed mobley-trent closed 10 months ago

mobley-trent commented 10 months ago

Apache Iceberg version

0.5.0 (latest release)

Please describe the bug 🐞

Python = 3.11 PostgreSql = v16

I'm having issues setting up the initial connection to postgresql. I'm trying to run the code in the docs but I'm getting this error:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "iceberg_tables" does not exist
LINE 2: FROM iceberg_tables
             ^

This is my code:

catalog = load_catalog(
    "demo",
    **{
        "type": "sql",
        "uri": "postgresql+psycopg2://[username]:[passwd]@localhost/demo"
    }
)

catalog.create_namespace("docs_example")

This is the .pyiceberg.yaml file:

catalog:
  demo:
    type: sql
    uri: postgresql+psycopg2://[username]:[passwd]@localhost/demo

demo is the name of the database I'm using, it has a table named students which I'm trying to access. I've attached the full stack trace for your review. stack_trace_iceberg.txt

gkaretka commented 10 months ago

👍 same issue

Fokko commented 10 months ago

@mobley-trent @gkaretka Thanks for reaching out here. The tables are not created by default, but I think that might be the wrong behaviour since you both expected them to be created automatically.

To get things working on your end, you can run:

catalog = load_catalog(
    "demo",
    **{
        "type": "sql",
        "uri": "postgresql+psycopg2://[username]:[passwd]@localhost/demo"
    }
)

catalog.create_tables()
gkaretka commented 10 months ago

Thanks 👍 was actually trying to investigate a little further and came to the same conclusion. Creating tables by calling:

catalog.create_tables()

solved my problem.

gkaretka commented 10 months ago

Maybe this issue can be renamed into: documentation rather than bug. What do you think @mobley-trent ?

Fokko commented 10 months ago

I have the same feeling, I think this should be done once, and not every time you initialize the catalog (if you run a lot of jobs in parallel using Airflow, these calls can add up). I think we can add it here to the docs as well.

Fokko commented 10 months ago

Are you interested in providing a PR? :)

mobley-trent commented 10 months ago

Yes I will look into this 👍 @Fokko

mobley-trent commented 10 months ago

@Fokko What should we place in the location parameter of catalog.create_table(...) ? Looking at the code, we are supposed to specify a warehouse location but I'm not quite sure what this means in regards to Postgresql

Fokko commented 10 months ago

This should be a location to a bucket where the metadata and the data itself will be stored.

I think this finds its origin in Hive: https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-cdh/topics/hive-khub-location-syntax.html but it similar relevant to the SQL Catalog.