duckdb / duckdb_iceberg

MIT License
160 stars 23 forks source link

Can't use the extension if my data catalog did not create a version-hint.text file #29

Open jacopotagliabue opened 1 year ago

jacopotagliabue commented 1 year ago

My s3 bucket with iceberg (picture below) cannot be queried with

iceberg_scan('s3://bucket/iceberg', ALLOW_MOVED_PATHS=true)

nor

iceberg_scan('s3://bucket/iceberg/*', ALLOW_MOVED_PATHS=true)

In particular the system is trying to find a very specific file (so the * pattern gets ignored):

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL https://bucket.s3.amazonaws.com/iceberg/metadata/version-hint.text

Unfortunately that file does not exist in my iceberg/ folder, nor in any of the iceberg/sub/metadata folders. Compared to the data zip in duckdb docs about iceberg, it is clear "my iceberg tables" are missing that file, which is important for the current implementation.

That said, version-hint seems something we do not really need, as that info can default to a version or being an additional parameter perhaps (instead of failing if the file is not found)?

Original discussion with @Alex-Monahan in dbt Slack is here: note that I originally got pointed to this as a possible cause, so perhaps reading a table that is formally Iceberg is not really independent from the data catalog it belongs to?

s3_structure

jacopotagliabue commented 1 year ago

Sorry to be a bit clearer: even if we fix the version-hint problem, the fact that the system is looking at https://bucket.s3.amazonaws.com/iceberg/metadata/ as a base path seems to be not aligned with the state of my data lake (see the picture above for the current layout, written by Spark Nessie).

Happy to help debug this if there's something we can quickly try out.

harel-e commented 12 months ago

I ran into similar issue using AWS with Glue as the catalog for Iceberg.

The metadata files stored in S3 are of the following pattern:

00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json 00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json

I suspect Glue holds the pointer to the current metadata.

rustyconover commented 12 months ago

It does.

You can see the current pointer in table properties if you call Glue’s DescribeTable

On Sun, Nov 26, 2023 at 10:37 Harel Efraim @.***> wrote:

I ran into similar issue using AWS with Glue as the catalog for Iceberg.

The metadata files stored in S3 are of the following pattern:

00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json 00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json

I suspect Glue holds the pointer to the current metadata.

— Reply to this email directly, view it on GitHub https://github.com/duckdb/duckdb_iceberg/issues/29#issuecomment-1826830382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSWJLGUYXUROAA5YGISITYGNV4HAVCNFSM6AAAAAA7VLFQWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWHAZTAMZYGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

samansmink commented 12 months ago

Currently no iceberg catalog implementations are available in the iceberg extension. Without a version hint you will need to pass the direct path to the correct metadata file manually, check: https://github.com/duckdb/duckdb_iceberg/pull/18

jacopotagliabue commented 12 months ago

@samansmink thanks, but the work-around does not seem the work tough: I get s3://bucet/iceberg/taxi_fhvhv_bbb/metadata/aaa.metadata.json from my datacatalog manually and pass it to my query:

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

I still get a 404 with version file

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "....metadata.json/metadata/version-hint.text": 404 (Not Found)

As if it was trying to append the metadata/version-hint.text to my JSON path. Am I doing something dumb?

jacopotagliabue commented 12 months ago

Small update - I needed to update to 0.9.2 to scan a json file (posting here in case others stumble). The new error I get is No such file or directory on a path the scan found

"s3a://bucketiceberg/taxi_fhvhv/metadata/snap-aaaa.avro"

If I try with allow_moved_paths (the only thing it came to mind), I then get:

duckdb.duckdb.InvalidInputException: Invalid Input Error: Enabling allow_moved_paths is not enabled for directly scanning metadata files.

Any way around all of this?

jacopotagliabue commented 12 months ago

Small update 2 - I think I know why the avro path resolution does not work, just by looking closely at:

duckdb.duckdb.IOException: IO Error: Cannot open file "s3a://.......avro": No such file or directory

A nessie (written with Spark) file system uses s3a:// as the prefix, not s3 like presumably duckdb does. In fact, if I manually change s3a://.......avro into s3://.......avro, I can find the file in my data lake!

Quick way to patch this would be to replace the nessie prefix with the standard s3 one for object storage paths (or allow a flag that somehow toggles that behavior etc.). A longer term fix seems to have nessie return non-nessie-specific paths, but more general ones.

What do you think could be a short-term work-around @samansmink ?

samansmink commented 12 months ago

@jacopotagliabue s3a urls are indeed not supported currently.

If s3a:// urls are interoperable with s3 urls which, as far as i can tell from a quick look, seems to be the case? we could consider adding it to duckdb which would solve this issue

jacopotagliabue commented 12 months ago

That would be great and the easiest fix - I'll reach out to the nessie folks anyway to let them know about this, but if you could do the change in duckdb that would (presumably?) solve the current issue.

samansmink commented 12 months ago

https://github.com/duckdb/duckdb/pull/9817

harel-e commented 11 months ago

For Java iceberg users out there, I found a solution to retrieve the latest metadata without having to query the catalog directly.

Once you load the table from the catalog, you can issue the following method that will return the latest metadata location. You can use that location with iceberg_scan function.

public static String currentMetadataLocation(org.apache.iceberg.Table table) {
    return ((BaseTable) table).operations().current().metadataFileLocation();
}

I tested it on both Glue and Nessie.

It should make it somewhat easier, but I still hope there will be a cleaner solution in the extension later on

jacopotagliabue commented 11 months ago

hi @harel-e, just making sure I understand.

If you pass the json you get back from a nessie endpoint using the standard API for the table, and the issue something like:

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

you are able to get duckdb iceberg working?

harel-e commented 11 months ago

Yes, DuckDB 0.9.2 with Iceberg is working for me on the following setups:

a. AWS S3 + AWS Glue b. MinIO + Nessie

wolfeidau commented 8 months ago

I was able to get this working by looking up the current metadata URL using the glue API/CLI, then used that URL to query iceberg.

select count(*) from iceberg_scan('s3://cfanalytics-abc123/cloudfront_logs_analytics/metadata/abf3a652-02cb-4a8e-8b6c-2089a2acfe6c.metadata.json');

Works for me at the moment.

teaguesterling commented 8 months ago

This appears to also be an issue with iceberg tables created using the Iceberg quick start at https://iceberg.apache.org/spark-quickstart/#docker-compose (using duckdb 0.10.0)

There are a few other oddities and observations:

The prefixing of the "v" when looking for the .metadata.json seems to be the most burdensome as it's not terribly difficult to maintain a version-hint.text file but it would be difficult to rename versions.

jacopotagliabue commented 7 months ago

Confirming that

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

still does not work with Dremio created table, Nessie catalog.

Error is: duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "https://bauplan-openlake-db87a23.s3.amazonaws.com/iceberg/taxi_fhvhv_partitioned/metadata/00000-136374fe-87d3-4cc6-8202-0a11f6af0b56.metadata.json/metadata/version-hint.text": 404 (Not Found)

Any chance we could make the version hint optional if they are not part of official Iceberg specs and many implementations seem to ignore them?

ekrata-main commented 7 months ago

Can confirm that this still does not work for iceberg tables created with catalog.create_table()

query: f"SELECT * FROM iceberg_scan('{lakehouse_path}') WHERE id = {mock_team_id}"

error: duckdb.duckdb.HTTPException: HTTP Error: Unable to connect to URL "https://local-lakehousesta-locallakehousebuck-mnrnr57ascjc.s3.amazonaws.com/metadata/version-hint.text": 404 (Not Found)

Pyiceberg workaround: Load the Iceberg table using a pyiceberg catalog (i'm using glue), then use the metadata_location field for the scan.

lakehouse_catalog = load_catalog( "glue", **{"type": "glue", "s3.region": "us-east-1"} )

team_table = lakehouse_catalog.load_table("default.Team")

changed_team_record = conn.sql( f"SELECT * FROM iceberg_scan('{team_table.metadata_location}') WHERE id = {mock_team_id}" ).to_df()

mike-luabase commented 5 months ago

I'm using this work around for a sqlite catalog

import shutil
import os
import sqlite3

def create_metadata_for_tables(tables):
    """
    Iterate through all tables and create metadata files.

    Parameters:
        tables (list): A list of dictionaries, each representing an Iceberg table with a 'metadata_location'.
    """
    for table in tables:
        metadata_location = table['metadata_location'].replace('file://', '')
        metadata_dir = os.path.dirname(metadata_location)
        new_metadata_file = os.path.join(metadata_dir, 'v1.metadata.json')
        version_hint_file = os.path.join(metadata_dir, 'version-hint.text')

        # Ensure the metadata directory exists
        os.makedirs(metadata_dir, exist_ok=True)

        # Copy the metadata file to v1.metadata.json
        shutil.copy(metadata_location, new_metadata_file)
        print(f"Copied metadata file to {new_metadata_file}")

        # Create the version-hint.text file with content "1"
        with open(version_hint_file, 'w') as f:
            f.write('1')
        print(f"Created {version_hint_file} with content '1'")

def get_iceberg_tables(database_path):
    """
    Connect to the SQLite database and retrieve the list of Iceberg tables.

    Parameters:
        database_path (str): The path to the SQLite database file.

    Returns:
        list: A list of dictionaries, each representing an Iceberg table.
    """
    # Connect to the SQLite database
    con_meta = sqlite3.connect(database_path)
    con_meta.row_factory = sqlite3.Row

    # Create a cursor object to execute SQL queries
    cursor = con_meta.cursor()

    # Query to list all tables in the database
    query = 'SELECT * FROM "iceberg_tables" ORDER BY "catalog_name", "table_namespace", "table_name";'

    # Execute the query
    cursor.execute(query)

    # Fetch all results
    results = cursor.fetchall()

    # Convert results to list of dictionaries
    table_list = []
    for row in results:
        row_dict = {key: row[key] for key in row.keys()}
        table_list.append(row_dict)

    # Close the connection
    con_meta.close()

    return table_list

Usage:

database_path = "/your/path"

# Retrieve the list of Iceberg tables
tables = get_iceberg_tables(database_path)

# Create metadata for each table
create_metadata_for_tables(tables)

# Print the final tables list
for table in tables:
    print(table)
karakanb commented 5 months ago

I can confirm that the issue persists on duckdb v1.0.0 1f98600c2c and the getting started examples from the Apache Iceberg docs using local minio. the file lives in the bucket warehouse with the full URI s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json, I can confirm I can access the file using the AWS CLI with the same path.

here's what I did in duckdb cli:

INSTALL iceberg;
LOAD iceberg;
INSTALL httpfs;
LOAD httpfs;

CREATE SECRET secret1 (
    TYPE S3,
    KEY_ID 'key-here',
    SECRET 'secret-here',
    REGION 'us-east-1',
    ENDPOINT '127.0.0.1:9000',
    USE_SSL 'false'
);

SELECT * FROM iceberg_scan('s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json');

> HTTP Error: Unable to connect to URL "http://warehouse.minio.iceberg.orb.local:9000/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json": 404 (Not Found)
lamb-russell commented 4 months ago

PyIceberg (as of version 0.6) created iceberg tables don't create these version-hint.text files. This makes it difficult to use iceberg with duckdb unless you're reading from files creating with spark or another engine that creates those version-hint.text files. Being reliant on a spark engine to do writes undermines one major advantage duckdb: that it can do more with less. PyIceberg is a lightweight iceberg library that can be a great pairing with duckdb and this extension.

Not sure if true but Version-hint.text files aren't part of the standard spec as noted in this issue.

For an example of a notebook creating an iceberg table that cannot be read by duckdb see here

teaguesterling commented 4 months ago

I'd be interested in taking a stab at resolving this issue but would want to get some agreement on what the appropriate logic flow should be. As I see it, there's at least a few questions: 1) Would we want it to always check for a version-hint.txt allow it as an (on by) default option? 2) Would we want to allow for handling both the v0000#... vs. the 0000#... format? Do we check for the v000#... version first and then for the non-prefixed version or vice versa? 3) In the event that no version-hint.txt is present, what should the logic be to find the appropriate metadata file? Should we list them all out and select the one with the highest version #? If so, do we do that every time the table is read?

lamb-russell commented 3 months ago
  1. I would suggest using version-hint.text if it's there, otherwise the caller must pass in the latest metadata path. My reasoning is it's less disruptive to keep supporting the existing version-hint by default, but there should be more clear error messaging so users know what the issue is.
  2. allow for a default style (e.g. v0000# but have an argument for different prefix style. Default to whatever the current default is (e.g. v0000#)
  3. Make the user provide one explicitly. They can find the latest version in their catalog.
teaguesterling commented 3 months ago

@lamb-russell I agree. I was thinking about this a bit more and thinking we'd do more-or-less what you proposed but with a few adjustments:

  1. Agreed. Internally, I think it also makes sense to change some of the internal functions to depend on a specific metadata file instead of searching lower. That way we can identify if the error will occur much higher in the stack (and it reduces the overall complexity).
  2. I think we could do a step further and have a table_version_format printf-style string that merges the prefix, table version, compression codec, and extensions. It may be overkill, but it'd give more flexibility in the face of a feature with no standard. The current (and default) would look something like "metadata/v%s%s.metadata.json". I could see this eventually being configurable instead of being a constant in the future.
  3. Agreed. It may be nice to have a utility function to find the latest version with some sort of rule: either highest 00### value or latest created/modified data? Maybe that's overkill, so I am not going to touch that for now.
teaguesterling commented 3 months ago

FYI, I'll be working on this in https://github.com/duckdb/duckdb_iceberg/pull/63. It may take me a little bit to get my dev environment up and running, but I hope to wrap it up soon.

Suggestions for the possible solution (or changes in direction) welcome.

teaguesterling commented 3 months ago

This is in a working state now. It essentially adds two parameters to the iceberg functions:

teaguesterling commented 3 months ago

As this PR doesn't make it "just work" yet, are there any other changes we'd want to consider to support the absence of the version hint?

It would be heavy handed, but we could glob out all the metadata files and pick the largest version (or most recent)?

ramonvermeulen commented 3 months ago

@teaguesterling

I just made a build out of the latest main branch (including your changes) via make debug, however when I try to use the version parameter I get the following error:

D SELECT count(*) FROM iceberg_scan("/home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test", version="00011-09891886-f8e1-403b-a5e9-a36cb11a41c0");
IO Error: Cannot open file "file:///home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test/metadata/snap-9095995686510048943-0-fdce8f11-65e4-4ae8-9c8e-e0f11cd48f2d.avro": No such file or directory

It definitely finds the metadata file, otherwise it can't find the link to the avro path. The weird thing is, the Avro file does actually exists. For example when I try to cat it without the file:// prefix:

cat /home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test/metadata/snap-9095995686510048943-0-fdce8f11-65e4-4ae8-9c8e-e0f11cd48f2d.avro      
Obj
   snapshot-id&9095995686510048943$parent-snapshot-id$866746840554898102sequence-number20format-version2avro.schema�{"type": "record", "fields": [{"name": "manifest_path", "field-id": 500, "type": "string", "doc": "Location URI with FS scheme"}, {"name": "manifest_length", "field-id": 501, "type": "long

Also to be certain that it is not some kind of OS-level permission issue, I put 777 on the exact avro file. Any idea what could be causing it, I made the iceberg file via pyiceberg with basically their default example from the quickstart.

teaguesterling commented 2 months ago

I was able to reproduce this and also observed the same behavior when looking at the test suite files. I'm going to re-test on the original PR branch to see if I can trace it down. Oddly, it seems like the 1st filesystem action works and all subsequent actions fail.

teaguesterling commented 2 months ago

Upon further investigation, the file:// paths are coming from the metadata files themselves. I was originally mistaken in thinking this was also happening with the test suite files. It does seem to be caused by subsequent filesystem actions, but only because the paths in the metadata files don't work with duckdb_iceberg's expectations.

I suspect what you're observing is actually a different issue having to do with the way pyiceberg represents relative paths being incompatible with the way the duckdb_iceberg extension is handling them. This should probably be handled as a separate issue. It seems that we probably need to have the metadata files handle the file:// prefix.

I don't know this part of the code well enough yet to make any other claims about it. I can show that it does seem to be related to paths with a few tests:

$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')"
IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro": No such file or directory
# This is the error reported above

$ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro test2/warehouse/default.db/taxi_dataset/metadata/*.json
test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
# We see that the file://...avro string is actually in the metadata.json file

$ sed -i "s|file://|$PWD/|g" test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
# We can remove the file:// prefix and make it a hard-coded absolute path in the JSON file
# Note that the same thing happens even if you make the path relative.

$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')"
IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro": No such file or directory
# We still get an error, but it's for a different file this time, not a `snap-....avro` file, but the `...-m0.avro` file

$ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro test2/warehouse/default.db/taxi_dataset/metadata/*
test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro
# This path is actually hard-coded in the avro file, so naively changing it will cause corruption.

I suspect it's a bigger problem that pyiceberg is writing out the paths to files relative to the working directory (note test2/... in the tests above) at creation time, though. All of this will break if the table ends up getting accessed from a different working directory

ramonvermeulen commented 2 months ago

I was able to reproduce this and also observed the same behavior when looking at the test suite files. I'm going to re-test on the original PR branch to see if I can trace it down. Oddly, it seems like the 1st filesystem action works and all subsequent actions fail.

At least glad to hear that you're able to reproduce it, then it's not just me!

Upon further investigation, the file:// paths are coming from the metadata files themselves. I was originally mistaken in thinking this was also happening with the test suite files. It does seem to be caused by subsequent filesystem actions, but only because the paths in the metadata files don't work with duckdb_iceberg's expectations.

I suspect what you're observing is actually a different issue having to do with the way pyiceberg represents relative paths being incompatible with the way the duckdb_iceberg extension is handling them. This should probably be handled as a separate issue. It seems that we probably need to have the metadata files handle the file:// prefix.

I don't know this part of the code well enough yet to make any other claims about it. I can show that it does seem to be related to paths with a few tests:

$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')"
IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro": No such file or directory
# This is the error reported above

$ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro test2/warehouse/default.db/taxi_dataset/metadata/*.json
test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
# We see that the file://...avro string is actually in the metadata.json file

$ sed -i "s|file://|$PWD/|g" test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
# We can remove the file:// prefix and make it a hard-coded absolute path in the JSON file
# Note that the same thing happens even if you make the path relative.

$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')"
IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro": No such file or directory
# We still get an error, but it's for a different file this time, not a `snap-....avro` file, but the `...-m0.avro` file

$ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro test2/warehouse/default.db/taxi_dataset/metadata/*
test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro
# This path is actually hard-coded in the avro file, so naively changing it will cause corruption.

I suspect it's a bigger problem that pyiceberg is writing out the paths to files relative to the working directory (note test2/... in the tests above) at creation time, though. All of this will break if the table ends up getting accessed from a different working directory

That's interesting, in the example that I tested the paths that pyiceberg wrote were absolute and not relative to the working directory. Apparently it depends on how you configure pyiceberg it's uri and warehouse.

For example if you would do this f"file://{Path(__file__).parent}/warehouse" it works fine with absolute paths, but "file://./warehouse" and it writes to relative path. Then I found out you can also just configure uri without the file:// prefix that they use in their example, and everything seems works fine. Currently building the extension to see if I am able to load in a new (py)iceberg table that uses absolute paths without file:// prefix.

UPDATE: Yes you can.

D SELECT count(*) FROM iceberg_scan("/home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test", version="00002-49aec47d-117b-494a-a22e-6198661826b0");
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      3066766 │
└──────────────┘

As this PR doesn't make it "just work" yet, are there any other changes we'd want to consider to support the absence of the version hint?

It would be heavy handed, but we could glob out all the metadata files and pick the largest version (or most recent)?

I think this could be a potential solution, at least you don't have to provide the version flag yourself anymore. However I don't know the internals of iceberg well enough to say if this is a good solution.

teaguesterling commented 2 months ago

For example if you would do this f"file://{Path(file).parent}/warehouse" it works fine with absolute paths, but "file://./warehouse" and it writes to relative path. Then I found out you can also just configure uri without the file:// prefix that they use in their example, and everything seems works fine. Currently building the extension to see if I am able to load in a new (py)iceberg table that uses absolute paths without file:// prefix.

UPDATE: Yes you can.

That's great that it works without the file:// prefix! This would imply that we probably should add some more intelligent handling around schemas into the metadata loading. If file:// doesn't work, I don't know if other schemas would be expected to work either. It's probably not too hard to make this change, but will touch a lot of places.

I think this could be a potential solution, at least you don't have to provide the version flag yourself anymore. However I don't know the internals of iceberg well enough to say if this is a good solution.

Neither do I, but I'm not opposed to it. Perhaps something where version can also provided as a pattern in addition to a version or hint file?

teaguesterling commented 2 months ago

Actually, I took another look at this and I think the issue is deeper than the iceberg extension. It seems ducdkb wont work with the file:// at all. It may be worth raising an issue upstream about this to create a subfilesystem with the "file://" protocol.

See this example:

from 'file://data/iceberg/generated_spec1_0_001/pyspark_iceberg_table/data/00000-5-bd694195-a731-4121-be17-0a6b13d4e9fb-00001.parquet';
IO Error: No files found that match the pattern "file://data/iceberg/generated_spec1_0_001/pyspark_iceberg_table/data/00000-5-bd694195-a731-4121-be17-0a6b13d4e9fb-00001.parquet"
kevinjqliu commented 2 months ago

Duckdb's delta_scan function currently handles the file:// prefix, as mentioned in https://github.com/duckdb/duckdb_iceberg/issues/38#issuecomment-2218300710

While I agree that supporting file:// natively in Duckdb should be the proper long-term fix, perhaps we can handle the prefix in this extension as a workaround.

humaidkidwai commented 2 months ago

Has anyone faced this error? SQL Error: java.sql.SQLException: Binder Error: Table "iceberg_scan_deletes" does not have a column named "file_path"

This is what my duckdb setup looks like:

`INSTALL iceberg; LOAD iceberg; INSTALL httpfs; LOAD httpfs;

SET s3_access_key_id='key'; SET s3_secret_access_key='secretKey'; SET s3_region='us-east-1'; SET s3_use_ssl=true; SET s3_url_style='path';

SELECT * FROM iceberg_scan('s3://my-bucket/observation/metadata/00004-bc91e4be-ee63-4922-89eb-f7730dbbee82.metadata.json');`

kevinjqliu commented 2 months ago

That error is coming from the binder https://github.com/duckdb/duckdb/blob/main/src/planner/table_binding.cpp#L222-L225 https://duckdb.org/docs/internals/overview.html#binder

I don't think its related to reading the iceberg table

humaidkidwai commented 2 months ago

Yeah, I realized Iceberg extension cannot read tables that have been updated or deleted. It can only read the tables that have been written once. So MoR tables cannot be read.

kevinjqliu commented 2 months ago

@humaidkidwai I see, thats a good point. Do you mind opening a new issue on this repo to track that? and perhaps include your example above? Its another issue we should tackle

soumilshah1995 commented 2 weeks ago

Any solution for this ?

bhaskar-pv commented 2 weeks ago

Sad to see when all the libraries started support for Iceberg yet Duckdb is yet to fully support it due to this issue

ramonvermeulen commented 2 weeks ago

For example if you would do this f"file://{Path(file).parent}/warehouse" it works fine with absolute paths, but "file://./warehouse" and it writes to relative path. Then I found out you can also just configure uri without the file:// prefix that they use in their example, and everything seems works fine. Currently building the extension to see if I am able to load in a new (py)iceberg table that uses absolute paths without file:// prefix.

UPDATE: Yes you can.

That's great that it works without the file:// prefix! This would imply that we probably should add some more intelligent handling around schemas into the metadata loading. If file:// doesn't work, I don't know if other schemas would be expected to work either. It's probably not too hard to make this change, but will touch a lot of places.

I think this could be a potential solution, at least you don't have to provide the version flag yourself anymore. However I don't know the internals of iceberg well enough to say if this is a good solution.

Neither do I, but I'm not opposed to it. Perhaps something where version can also provided as a pattern in addition to a version or hint file?

Since demand on this issue/feature seems to be increasing, I was wondering if this could still be a good potential solution? I don't know the iceberg format well enough to be really opinionated about this. If it is, I am willing to put some time into this and hopefully come up with a PR that resolves the issue for scenarios without version or hint file.

teaguesterling commented 2 weeks ago

I took a look at the file:// protocol and convinced myself that it should be handled in duckdb directly. It's not that Iceberg doesn't it, it's that duckdb doesn't at all. The up side of that is other protocols are likely not broken. S3 works for me.

The option to infer the latest version from file path patterns is something I've wanted to do for a while now. Maybe it's time.

I held off from including this in the version PR as I wasn't sure if it would actually make things worse/more confusing by not following the expectation of having a data catalog to provide that information. However, in the spirit of "just give me my data", maybe we should go for it.

Given the number of possible implementations, perhaps we should map out a spec on how this would work in this thread? I know the details of the extension well enough now that I would be happy to advise/help with the implementation.

teaguesterling commented 1 week ago

I've taken a some time this weekend to take a stab at finally finishing this. The new implementation adds one additional possible value you can provide to version_name that will try to guess the best table version. It also changes the default version from "version-hint.text" to "?", making this the default behavior. When table_version="?" the following steps will be followed:

  1. Check to see if a version-hint.text file exists. If so, use it.
  2. Create a glob pattern using the version_name_format, metadata_compression_codec values. Right now, this just uses "*" as the version but future versions could be more thoughtful about this.
  3. Glob the metadata directory to retrieve all matching metadata files.
  4. (Naively) sort the metadata files and pick the last one. This hopes that the latest version has the largest version number.

If you had a directory with the following files:

Loading with: iceberg_scan("warehouse/default.db/test", version="?", metadata_compression_codec="", version_name_format="v%s%s.metadata.json,%s%s.metadata.json") (the default values) will do the following: 1) Check for warehouse/default.db/taxi_dataset/metadata/version-hint.txt, which doesn't exist. 2) Glob the directory using the pattern "v*.metadata.json" (since the metadata_compression_codec is not "gzip") 3) No results are returned so it will continue to the next pattern and glob for: "*.metadata.json". 4) It will retrieve all .metadata.json files and pass them to the PickTableVersion function. 5) The table names will be sorted and the last one will be selected, resulting in table_version 00003-afdf5761-964a-4766-986d-c6d20f942df0 being selected.

Note that I haven't finished testing this yet or pushed it, but have put the implementation in place. Once we finish testing, review, and documentation it should be sufficient to close the issue as no version-hint, version, or catalog will be required to load an iceberg table the way we we'd expect to.

ramonvermeulen commented 1 week ago

I've taken a some time this weekend to take a stab at finally finishing this. The new implementation adds one additional possible value you can provide to version_name that will try to guess the best table version. It also changes the default version from "version-hint.text" to "?", making this the default behavior. When table_version="?" the following steps will be followed:

  1. Check to see if a version-hint.text file exists. If so, use it.
  2. Create a glob pattern using the version_name_format, metadata_compression_codec values. Right now, this just uses "*" as the version but future versions could be more thoughtful about this.
  3. Glob the metadata directory to retrieve all matching metadata files.
  4. (Naively) sort the metadata files and pick the last one. This hopes that the latest version has the largest version number.

If you had a directory with the following files:

  • warehouse/default.db/taxi_dataset/metadata/00000-17a6f9e5-48bb-4367-aa00-20190eb3c16e.metadata.json
  • warehouse/default.db/taxi_dataset/metadata/00001-a73e34f8-1bd7-4e72-a357-641a92541ade.metadata.json
  • warehouse/default.db/taxi_dataset/metadata/00002-fdfafee9-a5a0-4d0b-bc09-1caf708346ab.metadata.json
  • warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
  • warehouse/default.db/taxi_dataset/metadata/637f7b78-0006-4bda-b76a-8d7715fdcecb-m0.avro
  • warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro
  • warehouse/default.db/taxi_dataset/metadata/d8d5ba7b-9db1-447b-bc39-d94274d5e3a0-m0.avro
  • warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro
  • warehouse/default.db/taxi_dataset/metadata/snap-8163938669155459790-0-d8d5ba7b-9db1-447b-bc39-d94274d5e3a0.avro
  • warehouse/default.db/taxi_dataset/metadata/snap-88534494938550919-0-637f7b78-0006-4bda-b76a-8d7715fdcecb.avro

Loading with: iceberg_scan("warehouse/default.db/test", version="?", metadata_compression_codec="", version_name_format="v%s%s.metadata.json,%s%s.metadata.json") (the default values) will do the following:

  1. Check for warehouse/default.db/taxi_dataset/metadata/version-hint.txt, which doesn't exist.
  2. Glob the directory using the pattern "v*.metadata.json" (since the metadata_compression_codec is not "gzip")
  3. No results are returned so it will continue to the next pattern and glob for: "*.metadata.json".
  4. It will retrieve all .metadata.json files and pass them to the PickTableVersion function.
  5. The table names will be sorted and the last one will be selected, resulting in table_version 00003-afdf5761-964a-4766-986d-c6d20f942df0 being selected.

Note that I haven't finished testing this yet or pushed it, but have put the implementation in place. Once we finish testing, review, and documentation it should be sufficient to close the issue as no version-hint, version, or catalog will be required to load an iceberg table the way we we'd expect to.

Hi @teaguesterling,

Thanks again for putting in this effort, let me know if you want someone to test this functionality. I can build from source and test on a couple of different iceberg tables I have locally (some created via pyiceberg and therefore not having a version / version-hint file).

Would me nice to have some of the main contributors involved in this to get a good opinionated view on this implementation.

Any chance one of you has time to take a look at this? @samansmink @carlopi

samansmink commented 1 week ago

Happy to review a PR implementing this, I think it makes sense!