duckdb / duckdb_iceberg

MIT License
107 stars 18 forks source link

Can't use the extension if my data catalog did not create a version-hint.text file #29

Open jacopotagliabue opened 7 months ago

jacopotagliabue commented 7 months ago

My s3 bucket with iceberg (picture below) cannot be queried with

iceberg_scan('s3://bucket/iceberg', ALLOW_MOVED_PATHS=true)

nor

iceberg_scan('s3://bucket/iceberg/*', ALLOW_MOVED_PATHS=true)

In particular the system is trying to find a very specific file (so the * pattern gets ignored):

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL https://bucket.s3.amazonaws.com/iceberg/metadata/version-hint.text

Unfortunately that file does not exist in my iceberg/ folder, nor in any of the iceberg/sub/metadata folders. Compared to the data zip in duckdb docs about iceberg, it is clear "my iceberg tables" are missing that file, which is important for the current implementation.

That said, version-hint seems something we do not really need, as that info can default to a version or being an additional parameter perhaps (instead of failing if the file is not found)?

Original discussion with @Alex-Monahan in dbt Slack is here: note that I originally got pointed to this as a possible cause, so perhaps reading a table that is formally Iceberg is not really independent from the data catalog it belongs to?

s3_structure

jacopotagliabue commented 7 months ago

Sorry to be a bit clearer: even if we fix the version-hint problem, the fact that the system is looking at https://bucket.s3.amazonaws.com/iceberg/metadata/ as a base path seems to be not aligned with the state of my data lake (see the picture above for the current layout, written by Spark Nessie).

Happy to help debug this if there's something we can quickly try out.

harel-e commented 7 months ago

I ran into similar issue using AWS with Glue as the catalog for Iceberg.

The metadata files stored in S3 are of the following pattern:

00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json 00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json

I suspect Glue holds the pointer to the current metadata.

rustyconover commented 7 months ago

It does.

You can see the current pointer in table properties if you call Glue’s DescribeTable

On Sun, Nov 26, 2023 at 10:37 Harel Efraim @.***> wrote:

I ran into similar issue using AWS with Glue as the catalog for Iceberg.

The metadata files stored in S3 are of the following pattern:

00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json 00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json

I suspect Glue holds the pointer to the current metadata.

— Reply to this email directly, view it on GitHub https://github.com/duckdb/duckdb_iceberg/issues/29#issuecomment-1826830382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSWJLGUYXUROAA5YGISITYGNV4HAVCNFSM6AAAAAA7VLFQWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWHAZTAMZYGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

samansmink commented 7 months ago

Currently no iceberg catalog implementations are available in the iceberg extension. Without a version hint you will need to pass the direct path to the correct metadata file manually, check: https://github.com/duckdb/duckdb_iceberg/pull/18

jacopotagliabue commented 7 months ago

@samansmink thanks, but the work-around does not seem the work tough: I get s3://bucet/iceberg/taxi_fhvhv_bbb/metadata/aaa.metadata.json from my datacatalog manually and pass it to my query:

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

I still get a 404 with version file

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "....metadata.json/metadata/version-hint.text": 404 (Not Found)

As if it was trying to append the metadata/version-hint.text to my JSON path. Am I doing something dumb?

jacopotagliabue commented 7 months ago

Small update - I needed to update to 0.9.2 to scan a json file (posting here in case others stumble). The new error I get is No such file or directory on a path the scan found

"s3a://bucketiceberg/taxi_fhvhv/metadata/snap-aaaa.avro"

If I try with allow_moved_paths (the only thing it came to mind), I then get:

duckdb.duckdb.InvalidInputException: Invalid Input Error: Enabling allow_moved_paths is not enabled for directly scanning metadata files.

Any way around all of this?

jacopotagliabue commented 7 months ago

Small update 2 - I think I know why the avro path resolution does not work, just by looking closely at:

duckdb.duckdb.IOException: IO Error: Cannot open file "s3a://.......avro": No such file or directory

A nessie (written with Spark) file system uses s3a:// as the prefix, not s3 like presumably duckdb does. In fact, if I manually change s3a://.......avro into s3://.......avro, I can find the file in my data lake!

Quick way to patch this would be to replace the nessie prefix with the standard s3 one for object storage paths (or allow a flag that somehow toggles that behavior etc.). A longer term fix seems to have nessie return non-nessie-specific paths, but more general ones.

What do you think could be a short-term work-around @samansmink ?

samansmink commented 7 months ago

@jacopotagliabue s3a urls are indeed not supported currently.

If s3a:// urls are interoperable with s3 urls which, as far as i can tell from a quick look, seems to be the case? we could consider adding it to duckdb which would solve this issue

jacopotagliabue commented 7 months ago

That would be great and the easiest fix - I'll reach out to the nessie folks anyway to let them know about this, but if you could do the change in duckdb that would (presumably?) solve the current issue.

samansmink commented 7 months ago

https://github.com/duckdb/duckdb/pull/9817

harel-e commented 6 months ago

For Java iceberg users out there, I found a solution to retrieve the latest metadata without having to query the catalog directly.

Once you load the table from the catalog, you can issue the following method that will return the latest metadata location. You can use that location with iceberg_scan function.

public static String currentMetadataLocation(org.apache.iceberg.Table table) {
    return ((BaseTable) table).operations().current().metadataFileLocation();
}

I tested it on both Glue and Nessie.

It should make it somewhat easier, but I still hope there will be a cleaner solution in the extension later on

jacopotagliabue commented 6 months ago

hi @harel-e, just making sure I understand.

If you pass the json you get back from a nessie endpoint using the standard API for the table, and the issue something like:

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

you are able to get duckdb iceberg working?

harel-e commented 6 months ago

Yes, DuckDB 0.9.2 with Iceberg is working for me on the following setups:

a. AWS S3 + AWS Glue b. MinIO + Nessie

wolfeidau commented 3 months ago

I was able to get this working by looking up the current metadata URL using the glue API/CLI, then used that URL to query iceberg.

select count(*) from iceberg_scan('s3://cfanalytics-abc123/cloudfront_logs_analytics/metadata/abf3a652-02cb-4a8e-8b6c-2089a2acfe6c.metadata.json');

Works for me at the moment.

teaguesterling commented 3 months ago

This appears to also be an issue with iceberg tables created using the Iceberg quick start at https://iceberg.apache.org/spark-quickstart/#docker-compose (using duckdb 0.10.0)

There are a few other oddities and observations:

The prefixing of the "v" when looking for the .metadata.json seems to be the most burdensome as it's not terribly difficult to maintain a version-hint.text file but it would be difficult to rename versions.

jacopotagliabue commented 3 months ago

Confirming that

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

still does not work with Dremio created table, Nessie catalog.

Error is: duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "https://bauplan-openlake-db87a23.s3.amazonaws.com/iceberg/taxi_fhvhv_partitioned/metadata/00000-136374fe-87d3-4cc6-8202-0a11f6af0b56.metadata.json/metadata/version-hint.text": 404 (Not Found)

Any chance we could make the version hint optional if they are not part of official Iceberg specs and many implementations seem to ignore them?

ekrata-main commented 2 months ago

Can confirm that this still does not work for iceberg tables created with catalog.create_table()

query: f"SELECT * FROM iceberg_scan('{lakehouse_path}') WHERE id = {mock_team_id}"

error: duckdb.duckdb.HTTPException: HTTP Error: Unable to connect to URL "https://local-lakehousesta-locallakehousebuck-mnrnr57ascjc.s3.amazonaws.com/metadata/version-hint.text": 404 (Not Found)

Pyiceberg workaround: Load the Iceberg table using a pyiceberg catalog (i'm using glue), then use the metadata_location field for the scan.

lakehouse_catalog = load_catalog( "glue", **{"type": "glue", "s3.region": "us-east-1"} )

team_table = lakehouse_catalog.load_table("default.Team")

changed_team_record = conn.sql( f"SELECT * FROM iceberg_scan('{team_table.metadata_location}') WHERE id = {mock_team_id}" ).to_df()

mike-luabase commented 3 weeks ago

I'm using this work around for a sqlite catalog

import shutil
import os
import sqlite3

def create_metadata_for_tables(tables):
    """
    Iterate through all tables and create metadata files.

    Parameters:
        tables (list): A list of dictionaries, each representing an Iceberg table with a 'metadata_location'.
    """
    for table in tables:
        metadata_location = table['metadata_location'].replace('file://', '')
        metadata_dir = os.path.dirname(metadata_location)
        new_metadata_file = os.path.join(metadata_dir, 'v1.metadata.json')
        version_hint_file = os.path.join(metadata_dir, 'version-hint.text')

        # Ensure the metadata directory exists
        os.makedirs(metadata_dir, exist_ok=True)

        # Copy the metadata file to v1.metadata.json
        shutil.copy(metadata_location, new_metadata_file)
        print(f"Copied metadata file to {new_metadata_file}")

        # Create the version-hint.text file with content "1"
        with open(version_hint_file, 'w') as f:
            f.write('1')
        print(f"Created {version_hint_file} with content '1'")

def get_iceberg_tables(database_path):
    """
    Connect to the SQLite database and retrieve the list of Iceberg tables.

    Parameters:
        database_path (str): The path to the SQLite database file.

    Returns:
        list: A list of dictionaries, each representing an Iceberg table.
    """
    # Connect to the SQLite database
    con_meta = sqlite3.connect(database_path)
    con_meta.row_factory = sqlite3.Row

    # Create a cursor object to execute SQL queries
    cursor = con_meta.cursor()

    # Query to list all tables in the database
    query = 'SELECT * FROM "iceberg_tables" ORDER BY "catalog_name", "table_namespace", "table_name";'

    # Execute the query
    cursor.execute(query)

    # Fetch all results
    results = cursor.fetchall()

    # Convert results to list of dictionaries
    table_list = []
    for row in results:
        row_dict = {key: row[key] for key in row.keys()}
        table_list.append(row_dict)

    # Close the connection
    con_meta.close()

    return table_list

Usage:

database_path = "/your/path"

# Retrieve the list of Iceberg tables
tables = get_iceberg_tables(database_path)

# Create metadata for each table
create_metadata_for_tables(tables)

# Print the final tables list
for table in tables:
    print(table)
karakanb commented 2 weeks ago

I can confirm that the issue persists on duckdb v1.0.0 1f98600c2c and the getting started examples from the Apache Iceberg docs using local minio. the file lives in the bucket warehouse with the full URI s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json, I can confirm I can access the file using the AWS CLI with the same path.

here's what I did in duckdb cli:

INSTALL iceberg;
LOAD iceberg;
INSTALL httpfs;
LOAD httpfs;

CREATE SECRET secret1 (
    TYPE S3,
    KEY_ID 'key-here',
    SECRET 'secret-here',
    REGION 'us-east-1',
    ENDPOINT '127.0.0.1:9000',
    USE_SSL 'false'
);

SELECT * FROM iceberg_scan('s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json');

> HTTP Error: Unable to connect to URL "http://warehouse.minio.iceberg.orb.local:9000/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json": 404 (Not Found)