Open jacopotagliabue opened 1 year ago
Sorry to be a bit clearer: even if we fix the version-hint
problem, the fact that the system is looking at https://bucket.s3.amazonaws.com/iceberg/metadata/
as a base path seems to be not aligned with the state of my data lake (see the picture above for the current layout, written by Spark Nessie).
Happy to help debug this if there's something we can quickly try out.
I ran into similar issue using AWS with Glue as the catalog for Iceberg.
The metadata files stored in S3 are of the following pattern:
00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json 00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json
I suspect Glue holds the pointer to the current metadata.
It does.
You can see the current pointer in table properties if you call Glue’s DescribeTable
On Sun, Nov 26, 2023 at 10:37 Harel Efraim @.***> wrote:
I ran into similar issue using AWS with Glue as the catalog for Iceberg.
The metadata files stored in S3 are of the following pattern:
00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json 00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json
I suspect Glue holds the pointer to the current metadata.
— Reply to this email directly, view it on GitHub https://github.com/duckdb/duckdb_iceberg/issues/29#issuecomment-1826830382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSWJLGUYXUROAA5YGISITYGNV4HAVCNFSM6AAAAAA7VLFQWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWHAZTAMZYGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Currently no iceberg catalog implementations are available in the iceberg extension. Without a version hint you will need to pass the direct path to the correct metadata file manually, check: https://github.com/duckdb/duckdb_iceberg/pull/18
@samansmink thanks, but the work-around does not seem the work tough: I get s3://bucet/iceberg/taxi_fhvhv_bbb/metadata/aaa.metadata.json
from my datacatalog manually and pass it to my query:
SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'
I still get a 404 with version file
duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "....metadata.json/metadata/version-hint.text": 404 (Not Found)
As if it was trying to append the metadata/version-hint.text
to my JSON path. Am I doing something dumb?
Small update - I needed to update to 0.9.2 to scan a json file (posting here in case others stumble). The new error I get is No such file or directory
on a path the scan found
"s3a://bucketiceberg/taxi_fhvhv/metadata/snap-aaaa.avro"
If I try with allow_moved_paths
(the only thing it came to mind), I then get:
duckdb.duckdb.InvalidInputException: Invalid Input Error: Enabling allow_moved_paths is not enabled for directly scanning metadata files.
Any way around all of this?
Small update 2 - I think I know why the avro path resolution does not work, just by looking closely at:
duckdb.duckdb.IOException: IO Error: Cannot open file "s3a://.......avro": No such file or directory
A nessie (written with Spark) file system uses s3a://
as the prefix, not s3
like presumably duckdb does. In fact, if I manually change s3a://.......avro
into s3://.......avro
, I can find the file in my data lake!
Quick way to patch this would be to replace the nessie prefix with the standard s3 one for object storage paths (or allow a flag that somehow toggles that behavior etc.). A longer term fix seems to have nessie return non-nessie-specific paths, but more general ones.
What do you think could be a short-term work-around @samansmink ?
@jacopotagliabue s3a urls are indeed not supported currently.
If s3a:// urls are interoperable with s3 urls which, as far as i can tell from a quick look, seems to be the case? we could consider adding it to duckdb which would solve this issue
That would be great and the easiest fix - I'll reach out to the nessie folks anyway to let them know about this, but if you could do the change in duckdb that would (presumably?) solve the current issue.
For Java iceberg users out there, I found a solution to retrieve the latest metadata without having to query the catalog directly.
Once you load the table from the catalog, you can issue the following method that will return the latest metadata location. You can use that location with iceberg_scan function.
public static String currentMetadataLocation(org.apache.iceberg.Table table) {
return ((BaseTable) table).operations().current().metadataFileLocation();
}
I tested it on both Glue and Nessie.
It should make it somewhat easier, but I still hope there will be a cleaner solution in the extension later on
hi @harel-e, just making sure I understand.
If you pass the json you get back from a nessie endpoint using the standard API for the table, and the issue something like:
SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'
you are able to get duckdb iceberg working?
Yes, DuckDB 0.9.2 with Iceberg is working for me on the following setups:
a. AWS S3 + AWS Glue b. MinIO + Nessie
I was able to get this working by looking up the current metadata URL using the glue API/CLI, then used that URL to query iceberg.
select count(*) from iceberg_scan('s3://cfanalytics-abc123/cloudfront_logs_analytics/metadata/abf3a652-02cb-4a8e-8b6c-2089a2acfe6c.metadata.json');
Works for me at the moment.
This appears to also be an issue with iceberg tables created using the Iceberg quick start at https://iceberg.apache.org/spark-quickstart/#docker-compose (using duckdb 0.10.0)
There are a few other oddities and observations:
version-hint.text
file pointing to one of the existing metadata.json
files, the iceberg scanner ends up looking for a file prefixed with a "v"
(e.g., 00000-d30b41d6-48c0-42db-b32e-29083b874a80
in version-hint.text
looks for v00000-d30b41d6-48c0-42db-b32e-29083b874a80.metadata.json
(but only 00000-d30b41d6-48c0-42db-b32e-29083b874a80.metadata.json
exists in the directory)
.metadata.json
to the expected v....metadata.json
path, everything works as expected..metadata.json
file as a binary minio lx.meta
file (as I did), you can crash DuckDB with a a segfault --- which may be more of a security risk than anything else.version-hint.text
contains invalid characters for a path (e.g., a trailing newline) they will be directly included in the requested ...metadata.json
path.The prefixing of the "v"
when looking for the .metadata.json
seems to be the most burdensome as it's not terribly difficult to maintain a version-hint.text
file but it would be difficult to rename versions.
Confirming that
SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'
still does not work with Dremio created table, Nessie catalog.
Error is:
duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "https://bauplan-openlake-db87a23.s3.amazonaws.com/iceberg/taxi_fhvhv_partitioned/metadata/00000-136374fe-87d3-4cc6-8202-0a11f6af0b56.metadata.json/metadata/version-hint.text": 404 (Not Found)
Any chance we could make the version hint optional if they are not part of official Iceberg specs and many implementations seem to ignore them?
Can confirm that this still does not work for iceberg tables created with catalog.create_table()
query: f"SELECT * FROM iceberg_scan('{lakehouse_path}') WHERE id = {mock_team_id}"
error: duckdb.duckdb.HTTPException: HTTP Error: Unable to connect to URL "https://local-lakehousesta-locallakehousebuck-mnrnr57ascjc.s3.amazonaws.com/metadata/version-hint.text": 404 (Not Found)
lakehouse_catalog = load_catalog( "glue", **{"type": "glue", "s3.region": "us-east-1"} )
team_table = lakehouse_catalog.load_table("default.Team")
changed_team_record = conn.sql( f"SELECT * FROM iceberg_scan('{team_table.metadata_location}') WHERE id = {mock_team_id}" ).to_df()
I'm using this work around for a sqlite catalog
import shutil
import os
import sqlite3
def create_metadata_for_tables(tables):
"""
Iterate through all tables and create metadata files.
Parameters:
tables (list): A list of dictionaries, each representing an Iceberg table with a 'metadata_location'.
"""
for table in tables:
metadata_location = table['metadata_location'].replace('file://', '')
metadata_dir = os.path.dirname(metadata_location)
new_metadata_file = os.path.join(metadata_dir, 'v1.metadata.json')
version_hint_file = os.path.join(metadata_dir, 'version-hint.text')
# Ensure the metadata directory exists
os.makedirs(metadata_dir, exist_ok=True)
# Copy the metadata file to v1.metadata.json
shutil.copy(metadata_location, new_metadata_file)
print(f"Copied metadata file to {new_metadata_file}")
# Create the version-hint.text file with content "1"
with open(version_hint_file, 'w') as f:
f.write('1')
print(f"Created {version_hint_file} with content '1'")
def get_iceberg_tables(database_path):
"""
Connect to the SQLite database and retrieve the list of Iceberg tables.
Parameters:
database_path (str): The path to the SQLite database file.
Returns:
list: A list of dictionaries, each representing an Iceberg table.
"""
# Connect to the SQLite database
con_meta = sqlite3.connect(database_path)
con_meta.row_factory = sqlite3.Row
# Create a cursor object to execute SQL queries
cursor = con_meta.cursor()
# Query to list all tables in the database
query = 'SELECT * FROM "iceberg_tables" ORDER BY "catalog_name", "table_namespace", "table_name";'
# Execute the query
cursor.execute(query)
# Fetch all results
results = cursor.fetchall()
# Convert results to list of dictionaries
table_list = []
for row in results:
row_dict = {key: row[key] for key in row.keys()}
table_list.append(row_dict)
# Close the connection
con_meta.close()
return table_list
Usage:
database_path = "/your/path"
# Retrieve the list of Iceberg tables
tables = get_iceberg_tables(database_path)
# Create metadata for each table
create_metadata_for_tables(tables)
# Print the final tables list
for table in tables:
print(table)
I can confirm that the issue persists on duckdb v1.0.0 1f98600c2c
and the getting started examples from the Apache Iceberg docs using local minio. the file lives in the bucket warehouse
with the full URI s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json
, I can confirm I can access the file using the AWS CLI with the same path.
here's what I did in duckdb cli:
INSTALL iceberg;
LOAD iceberg;
INSTALL httpfs;
LOAD httpfs;
CREATE SECRET secret1 (
TYPE S3,
KEY_ID 'key-here',
SECRET 'secret-here',
REGION 'us-east-1',
ENDPOINT '127.0.0.1:9000',
USE_SSL 'false'
);
SELECT * FROM iceberg_scan('s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json');
> HTTP Error: Unable to connect to URL "http://warehouse.minio.iceberg.orb.local:9000/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json": 404 (Not Found)
PyIceberg (as of version 0.6) created iceberg tables don't create these version-hint.text files. This makes it difficult to use iceberg with duckdb unless you're reading from files creating with spark or another engine that creates those version-hint.text files. Being reliant on a spark engine to do writes undermines one major advantage duckdb: that it can do more with less. PyIceberg is a lightweight iceberg library that can be a great pairing with duckdb and this extension.
Not sure if true but Version-hint.text files aren't part of the standard spec as noted in this issue.
For an example of a notebook creating an iceberg table that cannot be read by duckdb see here
I'd be interested in taking a stab at resolving this issue but would want to get some agreement on what the appropriate logic flow should be. As I see it, there's at least a few questions:
1) Would we want it to always check for a version-hint.txt
allow it as an (on by) default option?
2) Would we want to allow for handling both the v0000#...
vs. the 0000#...
format? Do we check for the v000#...
version first and then for the non-prefixed version or vice versa?
3) In the event that no version-hint.txt
is present, what should the logic be to find the appropriate metadata file? Should we list them all out and select the one with the highest version #? If so, do we do that every time the table is read?
v0000#
but have an argument for different prefix style. Default to whatever the current default is (e.g. v0000#
)@lamb-russell I agree. I was thinking about this a bit more and thinking we'd do more-or-less what you proposed but with a few adjustments:
table_version_format
printf-style string that merges the prefix, table version, compression codec, and extensions. It may be overkill, but it'd give more flexibility in the face of a feature with no standard. The current (and default) would look something like "metadata/v%s%s.metadata.json"
. I could see this eventually being configurable instead of being a constant in the future.00###
value or latest created/modified data? Maybe that's overkill, so I am not going to touch that for now.FYI, I'll be working on this in https://github.com/duckdb/duckdb_iceberg/pull/63. It may take me a little bit to get my dev environment up and running, but I hope to wrap it up soon.
Suggestions for the possible solution (or changes in direction) welcome.
This is in a working state now. It essentially adds two parameters to the iceberg functions:
version='....'
: Allows you to explicitly specify a version number for a table or a text
/txt
file that contains the name of a version (default: version='version-hint.text'
; examples: version='12'
version='my-custom-version-file.txt'
)version_name_format='...'
: Allows you to specify a comma-delimited list of format strings that should be tried to find the metadata file name based on the supplied version (default: version_name_format='v%s%s.metadata.json,%s%s.metadata.json'
; examples: version_name_format='version-%smetadata.json%s'
)As this PR doesn't make it "just work" yet, are there any other changes we'd want to consider to support the absence of the version hint?
It would be heavy handed, but we could glob out all the metadata files and pick the largest version (or most recent)?
@teaguesterling
I just made a build out of the latest main branch (including your changes) via make debug
, however when I try to use the version parameter I get the following error:
D SELECT count(*) FROM iceberg_scan("/home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test", version="00011-09891886-f8e1-403b-a5e9-a36cb11a41c0");
IO Error: Cannot open file "file:///home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test/metadata/snap-9095995686510048943-0-fdce8f11-65e4-4ae8-9c8e-e0f11cd48f2d.avro": No such file or directory
It definitely finds the metadata file, otherwise it can't find the link to the avro path.
The weird thing is, the Avro file does actually exists. For example when I try to cat it without the file://
prefix:
cat /home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test/metadata/snap-9095995686510048943-0-fdce8f11-65e4-4ae8-9c8e-e0f11cd48f2d.avro
Obj
snapshot-id&9095995686510048943$parent-snapshot-id$866746840554898102sequence-number20format-version2avro.schema�{"type": "record", "fields": [{"name": "manifest_path", "field-id": 500, "type": "string", "doc": "Location URI with FS scheme"}, {"name": "manifest_length", "field-id": 501, "type": "long
Also to be certain that it is not some kind of OS-level permission issue, I put 777 on the exact avro file. Any idea what could be causing it, I made the iceberg file via pyiceberg with basically their default example from the quickstart.
I was able to reproduce this and also observed the same behavior when looking at the test suite files. I'm going to re-test on the original PR branch to see if I can trace it down. Oddly, it seems like the 1st filesystem action works and all subsequent actions fail.
Upon further investigation, the file://
paths are coming from the metadata files themselves. I was originally mistaken in thinking this was also happening with the test suite files. It does seem to be caused by subsequent filesystem actions, but only because the paths in the metadata files don't work with duckdb_iceberg's expectations.
I suspect what you're observing is actually a different issue having to do with the way pyiceberg represents relative paths being incompatible with the way the duckdb_iceberg extension is handling them. This should probably be handled as a separate issue. It seems that we probably need to have the metadata files handle the file:// prefix.
I don't know this part of the code well enough yet to make any other claims about it. I can show that it does seem to be related to paths with a few tests:
$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')"
IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro": No such file or directory
# This is the error reported above
$ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro test2/warehouse/default.db/taxi_dataset/metadata/*.json
test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
# We see that the file://...avro string is actually in the metadata.json file
$ sed -i "s|file://|$PWD/|g" test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
# We can remove the file:// prefix and make it a hard-coded absolute path in the JSON file
# Note that the same thing happens even if you make the path relative.
$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')"
IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro": No such file or directory
# We still get an error, but it's for a different file this time, not a `snap-....avro` file, but the `...-m0.avro` file
$ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro test2/warehouse/default.db/taxi_dataset/metadata/*
test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro
# This path is actually hard-coded in the avro file, so naively changing it will cause corruption.
I suspect it's a bigger problem that pyiceberg is writing out the paths to files relative to the working directory (note test2/...
in the tests above) at creation time, though. All of this will break if the table ends up getting accessed from a different working directory
I was able to reproduce this and also observed the same behavior when looking at the test suite files. I'm going to re-test on the original PR branch to see if I can trace it down. Oddly, it seems like the 1st filesystem action works and all subsequent actions fail.
At least glad to hear that you're able to reproduce it, then it's not just me!
Upon further investigation, the
file://
paths are coming from the metadata files themselves. I was originally mistaken in thinking this was also happening with the test suite files. It does seem to be caused by subsequent filesystem actions, but only because the paths in the metadata files don't work with duckdb_iceberg's expectations.I suspect what you're observing is actually a different issue having to do with the way pyiceberg represents relative paths being incompatible with the way the duckdb_iceberg extension is handling them. This should probably be handled as a separate issue. It seems that we probably need to have the metadata files handle the file:// prefix.
I don't know this part of the code well enough yet to make any other claims about it. I can show that it does seem to be related to paths with a few tests:
$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')" IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro": No such file or directory # This is the error reported above $ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro test2/warehouse/default.db/taxi_dataset/metadata/*.json test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json # We see that the file://...avro string is actually in the metadata.json file $ sed -i "s|file://|$PWD/|g" test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json # We can remove the file:// prefix and make it a hard-coded absolute path in the JSON file # Note that the same thing happens even if you make the path relative. $ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')" IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro": No such file or directory # We still get an error, but it's for a different file this time, not a `snap-....avro` file, but the `...-m0.avro` file $ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro test2/warehouse/default.db/taxi_dataset/metadata/* test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro # This path is actually hard-coded in the avro file, so naively changing it will cause corruption.
I suspect it's a bigger problem that pyiceberg is writing out the paths to files relative to the working directory (note
test2/...
in the tests above) at creation time, though. All of this will break if the table ends up getting accessed from a different working directory
That's interesting, in the example that I tested the paths that pyiceberg wrote were absolute and not relative to the working directory. Apparently it depends on how you configure pyiceberg it's uri
and warehouse
.
For example if you would do this f"file://{Path(__file__).parent}/warehouse"
it works fine with absolute paths, but "file://./warehouse"
and it writes to relative path. Then I found out you can also just configure uri
without the file://
prefix that they use in their example, and everything seems works fine. Currently building the extension to see if I am able to load in a new (py)iceberg table that uses absolute paths without file:// prefix.
UPDATE: Yes you can.
D SELECT count(*) FROM iceberg_scan("/home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test", version="00002-49aec47d-117b-494a-a22e-6198661826b0");
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 3066766 │
└──────────────┘
As this PR doesn't make it "just work" yet, are there any other changes we'd want to consider to support the absence of the version hint?
It would be heavy handed, but we could glob out all the metadata files and pick the largest version (or most recent)?
I think this could be a potential solution, at least you don't have to provide the version flag yourself anymore. However I don't know the internals of iceberg well enough to say if this is a good solution.
For example if you would do this f"file://{Path(file).parent}/warehouse" it works fine with absolute paths, but "file://./warehouse" and it writes to relative path. Then I found out you can also just configure uri without the file:// prefix that they use in their example, and everything seems works fine. Currently building the extension to see if I am able to load in a new (py)iceberg table that uses absolute paths without file:// prefix.
UPDATE: Yes you can.
That's great that it works without the file://
prefix! This would imply that we probably should add some more intelligent handling around schemas into the metadata loading. If file://
doesn't work, I don't know if other schemas would be expected to work either. It's probably not too hard to make this change, but will touch a lot of places.
I think this could be a potential solution, at least you don't have to provide the version flag yourself anymore. However I don't know the internals of iceberg well enough to say if this is a good solution.
Neither do I, but I'm not opposed to it. Perhaps something where version can also provided as a pattern in addition to a version or hint file?
Actually, I took another look at this and I think the issue is deeper than the iceberg extension. It seems ducdkb wont work with the file://
at all. It may be worth raising an issue upstream about this to create a subfilesystem with the "file://" protocol.
See this example:
from 'file://data/iceberg/generated_spec1_0_001/pyspark_iceberg_table/data/00000-5-bd694195-a731-4121-be17-0a6b13d4e9fb-00001.parquet';
IO Error: No files found that match the pattern "file://data/iceberg/generated_spec1_0_001/pyspark_iceberg_table/data/00000-5-bd694195-a731-4121-be17-0a6b13d4e9fb-00001.parquet"
Duckdb's delta_scan
function currently handles the file://
prefix, as mentioned in https://github.com/duckdb/duckdb_iceberg/issues/38#issuecomment-2218300710
While I agree that supporting file://
natively in Duckdb should be the proper long-term fix, perhaps we can handle the prefix in this extension as a workaround.
Has anyone faced this error?
SQL Error: java.sql.SQLException: Binder Error: Table "iceberg_scan_deletes" does not have a column named "file_path"
This is what my duckdb setup looks like:
`INSTALL iceberg; LOAD iceberg; INSTALL httpfs; LOAD httpfs;
SET s3_access_key_id='key'; SET s3_secret_access_key='secretKey'; SET s3_region='us-east-1'; SET s3_use_ssl=true; SET s3_url_style='path';
SELECT * FROM iceberg_scan('s3://my-bucket/observation/metadata/00004-bc91e4be-ee63-4922-89eb-f7730dbbee82.metadata.json');`
That error is coming from the binder https://github.com/duckdb/duckdb/blob/main/src/planner/table_binding.cpp#L222-L225 https://duckdb.org/docs/internals/overview.html#binder
I don't think its related to reading the iceberg table
Yeah, I realized Iceberg extension cannot read tables that have been updated or deleted. It can only read the tables that have been written once. So MoR tables cannot be read.
@humaidkidwai I see, thats a good point. Do you mind opening a new issue on this repo to track that? and perhaps include your example above? Its another issue we should tackle
Any solution for this ?
Sad to see when all the libraries started support for Iceberg yet Duckdb is yet to fully support it due to this issue
For example if you would do this f"file://{Path(file).parent}/warehouse" it works fine with absolute paths, but "file://./warehouse" and it writes to relative path. Then I found out you can also just configure uri without the file:// prefix that they use in their example, and everything seems works fine. Currently building the extension to see if I am able to load in a new (py)iceberg table that uses absolute paths without file:// prefix.
UPDATE: Yes you can.
That's great that it works without the
file://
prefix! This would imply that we probably should add some more intelligent handling around schemas into the metadata loading. Iffile://
doesn't work, I don't know if other schemas would be expected to work either. It's probably not too hard to make this change, but will touch a lot of places.I think this could be a potential solution, at least you don't have to provide the version flag yourself anymore. However I don't know the internals of iceberg well enough to say if this is a good solution.
Neither do I, but I'm not opposed to it. Perhaps something where version can also provided as a pattern in addition to a version or hint file?
Since demand on this issue/feature seems to be increasing, I was wondering if this could still be a good potential solution? I don't know the iceberg format well enough to be really opinionated about this. If it is, I am willing to put some time into this and hopefully come up with a PR that resolves the issue for scenarios without version or hint file.
I took a look at the file:// protocol and convinced myself that it should be handled in duckdb directly. It's not that Iceberg doesn't it, it's that duckdb doesn't at all. The up side of that is other protocols are likely not broken. S3 works for me.
The option to infer the latest version from file path patterns is something I've wanted to do for a while now. Maybe it's time.
I held off from including this in the version PR as I wasn't sure if it would actually make things worse/more confusing by not following the expectation of having a data catalog to provide that information. However, in the spirit of "just give me my data", maybe we should go for it.
Given the number of possible implementations, perhaps we should map out a spec on how this would work in this thread? I know the details of the extension well enough now that I would be happy to advise/help with the implementation.
I've taken a some time this weekend to take a stab at finally finishing this. The new implementation adds one additional possible value you can provide to version_name
that will try to guess the best table version. It also changes the default version
from "version-hint.text"
to "?"
, making this the default behavior. When table_version="?"
the following steps will be followed:
version-hint.text
file exists. If so, use it.version_name_format
, metadata_compression_codec
values. Right now, this just uses "*"
as the version but future versions could be more thoughtful about this.If you had a directory with the following files:
warehouse/default.db/taxi_dataset/metadata/00000-17a6f9e5-48bb-4367-aa00-20190eb3c16e.metadata.json
warehouse/default.db/taxi_dataset/metadata/00001-a73e34f8-1bd7-4e72-a357-641a92541ade.metadata.json
warehouse/default.db/taxi_dataset/metadata/00002-fdfafee9-a5a0-4d0b-bc09-1caf708346ab.metadata.json
warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
warehouse/default.db/taxi_dataset/metadata/637f7b78-0006-4bda-b76a-8d7715fdcecb-m0.avro
warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro
warehouse/default.db/taxi_dataset/metadata/d8d5ba7b-9db1-447b-bc39-d94274d5e3a0-m0.avro
warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro
warehouse/default.db/taxi_dataset/metadata/snap-8163938669155459790-0-d8d5ba7b-9db1-447b-bc39-d94274d5e3a0.avro
warehouse/default.db/taxi_dataset/metadata/snap-88534494938550919-0-637f7b78-0006-4bda-b76a-8d7715fdcecb.avro
Loading with: iceberg_scan("warehouse/default.db/test", version="?", metadata_compression_codec="", version_name_format="v%s%s.metadata.json,%s%s.metadata.json")
(the default values) will do the following:
1) Check for warehouse/default.db/taxi_dataset/metadata/version-hint.txt
, which doesn't exist.
2) Glob the directory using the pattern "v*.metadata.json"
(since the metadata_compression_codec is not "gzip"
)
3) No results are returned so it will continue to the next pattern and glob for: "*.metadata.json"
.
4) It will retrieve all .metadata.json
files and pass them to the PickTableVersion
function.
5) The table names will be sorted and the last one will be selected, resulting in table_version 00003-afdf5761-964a-4766-986d-c6d20f942df0
being selected.
Note that I haven't finished testing this yet or pushed it, but have put the implementation in place. Once we finish testing, review, and documentation it should be sufficient to close the issue as no version-hint, version, or catalog will be required to load an iceberg table the way we we'd expect to.
I've taken a some time this weekend to take a stab at finally finishing this. The new implementation adds one additional possible value you can provide to
version_name
that will try to guess the best table version. It also changes the defaultversion
from"version-hint.text"
to"?"
, making this the default behavior. Whentable_version="?"
the following steps will be followed:
- Check to see if a
version-hint.text
file exists. If so, use it.- Create a glob pattern using the
version_name_format
,metadata_compression_codec
values. Right now, this just uses"*"
as the version but future versions could be more thoughtful about this.- Glob the metadata directory to retrieve all matching metadata files.
- (Naively) sort the metadata files and pick the last one. This hopes that the latest version has the largest version number.
If you had a directory with the following files:
warehouse/default.db/taxi_dataset/metadata/00000-17a6f9e5-48bb-4367-aa00-20190eb3c16e.metadata.json
warehouse/default.db/taxi_dataset/metadata/00001-a73e34f8-1bd7-4e72-a357-641a92541ade.metadata.json
warehouse/default.db/taxi_dataset/metadata/00002-fdfafee9-a5a0-4d0b-bc09-1caf708346ab.metadata.json
warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
warehouse/default.db/taxi_dataset/metadata/637f7b78-0006-4bda-b76a-8d7715fdcecb-m0.avro
warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro
warehouse/default.db/taxi_dataset/metadata/d8d5ba7b-9db1-447b-bc39-d94274d5e3a0-m0.avro
warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro
warehouse/default.db/taxi_dataset/metadata/snap-8163938669155459790-0-d8d5ba7b-9db1-447b-bc39-d94274d5e3a0.avro
warehouse/default.db/taxi_dataset/metadata/snap-88534494938550919-0-637f7b78-0006-4bda-b76a-8d7715fdcecb.avro
Loading with:
iceberg_scan("warehouse/default.db/test", version="?", metadata_compression_codec="", version_name_format="v%s%s.metadata.json,%s%s.metadata.json")
(the default values) will do the following:
- Check for
warehouse/default.db/taxi_dataset/metadata/version-hint.txt
, which doesn't exist.- Glob the directory using the pattern
"v*.metadata.json"
(since the metadata_compression_codec is not"gzip"
)- No results are returned so it will continue to the next pattern and glob for:
"*.metadata.json"
.- It will retrieve all
.metadata.json
files and pass them to thePickTableVersion
function.- The table names will be sorted and the last one will be selected, resulting in table_version
00003-afdf5761-964a-4766-986d-c6d20f942df0
being selected.Note that I haven't finished testing this yet or pushed it, but have put the implementation in place. Once we finish testing, review, and documentation it should be sufficient to close the issue as no version-hint, version, or catalog will be required to load an iceberg table the way we we'd expect to.
Hi @teaguesterling,
Thanks again for putting in this effort, let me know if you want someone to test this functionality. I can build from source and test on a couple of different iceberg tables I have locally (some created via pyiceberg and therefore not having a version / version-hint file).
Would me nice to have some of the main contributors involved in this to get a good opinionated view on this implementation.
Any chance one of you has time to take a look at this? @samansmink @carlopi
Happy to review a PR implementing this, I think it makes sense!
My s3 bucket with iceberg (picture below) cannot be queried with
iceberg_scan('s3://bucket/iceberg', ALLOW_MOVED_PATHS=true)
nor
iceberg_scan('s3://bucket/iceberg/*', ALLOW_MOVED_PATHS=true)
In particular the system is trying to find a very specific file (so the * pattern gets ignored):
duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL https://bucket.s3.amazonaws.com/iceberg/metadata/version-hint.text
Unfortunately that file does not exist in my iceberg/ folder, nor in any of the iceberg/sub/metadata folders. Compared to the data zip in duckdb docs about iceberg, it is clear "my iceberg tables" are missing that file, which is important for the current implementation.
That said, version-hint seems something we do not really need, as that info can default to a version or being an additional parameter perhaps (instead of failing if the file is not found)?
Original discussion with @Alex-Monahan in dbt Slack is here: note that I originally got pointed to this as a possible cause, so perhaps reading a table that is formally Iceberg is not really independent from the data catalog it belongs to?