apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
309 stars 114 forks source link

Query on multiple tables using DuckDB #827

Closed ArijitSinghEDA closed 1 week ago

ArijitSinghEDA commented 2 weeks ago

Feature Request / Improvement

A way to load and scan multiple tables (across different namespaces), and use DuckDB query to perform select and join operations (for conditional join)

kevinjqliu commented 1 week ago

Hi @ArijitSinghEDA could you provide an example?

ArijitSinghEDA commented 1 week ago

Suppose we have a table table1 in namespace ns1 with columns employee_id, employee_name, department_id, and we have a table table2 in namespace ns2 with columns department_id, department_name. What I am thinking of is something like this:

from pyiceberg.catalog import load_catalog
catalog = load_catalog("example")
tables = catalog.load_tables(("ns1", "table1"), ("ns2", "table2"))
con = tables.scan().to_duck_db()
df = con.execute(
    """"""select * from ns1.table1 join n2.table2 using (department_id)""""""
).fetchdf()
print(df)
mike-luabase commented 1 week ago

@ArijitSinghEDA you can use the iceberg extension in duckdb

It support multiple tables, but you'll need to follow the tips here https://github.com/duckdb/duckdb_iceberg/issues/44