Open jacopotagliabue opened 7 months ago
Sorry to be a bit clearer: even if we fix the version-hint
problem, the fact that the system is looking at https://bucket.s3.amazonaws.com/iceberg/metadata/
as a base path seems to be not aligned with the state of my data lake (see the picture above for the current layout, written by Spark Nessie).
Happy to help debug this if there's something we can quickly try out.
I ran into similar issue using AWS with Glue as the catalog for Iceberg.
The metadata files stored in S3 are of the following pattern:
00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json 00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json
I suspect Glue holds the pointer to the current metadata.
It does.
You can see the current pointer in table properties if you call Glue’s DescribeTable
On Sun, Nov 26, 2023 at 10:37 Harel Efraim @.***> wrote:
I ran into similar issue using AWS with Glue as the catalog for Iceberg.
The metadata files stored in S3 are of the following pattern:
00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json 00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json
I suspect Glue holds the pointer to the current metadata.
— Reply to this email directly, view it on GitHub https://github.com/duckdb/duckdb_iceberg/issues/29#issuecomment-1826830382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSWJLGUYXUROAA5YGISITYGNV4HAVCNFSM6AAAAAA7VLFQWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWHAZTAMZYGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Currently no iceberg catalog implementations are available in the iceberg extension. Without a version hint you will need to pass the direct path to the correct metadata file manually, check: https://github.com/duckdb/duckdb_iceberg/pull/18
@samansmink thanks, but the work-around does not seem the work tough: I get s3://bucet/iceberg/taxi_fhvhv_bbb/metadata/aaa.metadata.json
from my datacatalog manually and pass it to my query:
SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'
I still get a 404 with version file
duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "....metadata.json/metadata/version-hint.text": 404 (Not Found)
As if it was trying to append the metadata/version-hint.text
to my JSON path. Am I doing something dumb?
Small update - I needed to update to 0.9.2 to scan a json file (posting here in case others stumble). The new error I get is No such file or directory
on a path the scan found
"s3a://bucketiceberg/taxi_fhvhv/metadata/snap-aaaa.avro"
If I try with allow_moved_paths
(the only thing it came to mind), I then get:
duckdb.duckdb.InvalidInputException: Invalid Input Error: Enabling allow_moved_paths is not enabled for directly scanning metadata files.
Any way around all of this?
Small update 2 - I think I know why the avro path resolution does not work, just by looking closely at:
duckdb.duckdb.IOException: IO Error: Cannot open file "s3a://.......avro": No such file or directory
A nessie (written with Spark) file system uses s3a://
as the prefix, not s3
like presumably duckdb does. In fact, if I manually change s3a://.......avro
into s3://.......avro
, I can find the file in my data lake!
Quick way to patch this would be to replace the nessie prefix with the standard s3 one for object storage paths (or allow a flag that somehow toggles that behavior etc.). A longer term fix seems to have nessie return non-nessie-specific paths, but more general ones.
What do you think could be a short-term work-around @samansmink ?
@jacopotagliabue s3a urls are indeed not supported currently.
If s3a:// urls are interoperable with s3 urls which, as far as i can tell from a quick look, seems to be the case? we could consider adding it to duckdb which would solve this issue
That would be great and the easiest fix - I'll reach out to the nessie folks anyway to let them know about this, but if you could do the change in duckdb that would (presumably?) solve the current issue.
For Java iceberg users out there, I found a solution to retrieve the latest metadata without having to query the catalog directly.
Once you load the table from the catalog, you can issue the following method that will return the latest metadata location. You can use that location with iceberg_scan function.
public static String currentMetadataLocation(org.apache.iceberg.Table table) {
return ((BaseTable) table).operations().current().metadataFileLocation();
}
I tested it on both Glue and Nessie.
It should make it somewhat easier, but I still hope there will be a cleaner solution in the extension later on
hi @harel-e, just making sure I understand.
If you pass the json you get back from a nessie endpoint using the standard API for the table, and the issue something like:
SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'
you are able to get duckdb iceberg working?
Yes, DuckDB 0.9.2 with Iceberg is working for me on the following setups:
a. AWS S3 + AWS Glue b. MinIO + Nessie
I was able to get this working by looking up the current metadata URL using the glue API/CLI, then used that URL to query iceberg.
select count(*) from iceberg_scan('s3://cfanalytics-abc123/cloudfront_logs_analytics/metadata/abf3a652-02cb-4a8e-8b6c-2089a2acfe6c.metadata.json');
Works for me at the moment.
This appears to also be an issue with iceberg tables created using the Iceberg quick start at https://iceberg.apache.org/spark-quickstart/#docker-compose (using duckdb 0.10.0)
There are a few other oddities and observations:
version-hint.text
file pointing to one of the existing metadata.json
files, the iceberg scanner ends up looking for a file prefixed with a "v"
(e.g., 00000-d30b41d6-48c0-42db-b32e-29083b874a80
in version-hint.text
looks for v00000-d30b41d6-48c0-42db-b32e-29083b874a80.metadata.json
(but only 00000-d30b41d6-48c0-42db-b32e-29083b874a80.metadata.json
exists in the directory)
.metadata.json
to the expected v....metadata.json
path, everything works as expected..metadata.json
file as a binary minio lx.meta
file (as I did), you can crash DuckDB with a a segfault --- which may be more of a security risk than anything else.version-hint.text
contains invalid characters for a path (e.g., a trailing newline) they will be directly included in the requested ...metadata.json
path.The prefixing of the "v"
when looking for the .metadata.json
seems to be the most burdensome as it's not terribly difficult to maintain a version-hint.text
file but it would be difficult to rename versions.
Confirming that
SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'
still does not work with Dremio created table, Nessie catalog.
Error is:
duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "https://bauplan-openlake-db87a23.s3.amazonaws.com/iceberg/taxi_fhvhv_partitioned/metadata/00000-136374fe-87d3-4cc6-8202-0a11f6af0b56.metadata.json/metadata/version-hint.text": 404 (Not Found)
Any chance we could make the version hint optional if they are not part of official Iceberg specs and many implementations seem to ignore them?
Can confirm that this still does not work for iceberg tables created with catalog.create_table()
query: f"SELECT * FROM iceberg_scan('{lakehouse_path}') WHERE id = {mock_team_id}"
error: duckdb.duckdb.HTTPException: HTTP Error: Unable to connect to URL "https://local-lakehousesta-locallakehousebuck-mnrnr57ascjc.s3.amazonaws.com/metadata/version-hint.text": 404 (Not Found)
lakehouse_catalog = load_catalog( "glue", **{"type": "glue", "s3.region": "us-east-1"} )
team_table = lakehouse_catalog.load_table("default.Team")
changed_team_record = conn.sql( f"SELECT * FROM iceberg_scan('{team_table.metadata_location}') WHERE id = {mock_team_id}" ).to_df()
I'm using this work around for a sqlite catalog
import shutil
import os
import sqlite3
def create_metadata_for_tables(tables):
"""
Iterate through all tables and create metadata files.
Parameters:
tables (list): A list of dictionaries, each representing an Iceberg table with a 'metadata_location'.
"""
for table in tables:
metadata_location = table['metadata_location'].replace('file://', '')
metadata_dir = os.path.dirname(metadata_location)
new_metadata_file = os.path.join(metadata_dir, 'v1.metadata.json')
version_hint_file = os.path.join(metadata_dir, 'version-hint.text')
# Ensure the metadata directory exists
os.makedirs(metadata_dir, exist_ok=True)
# Copy the metadata file to v1.metadata.json
shutil.copy(metadata_location, new_metadata_file)
print(f"Copied metadata file to {new_metadata_file}")
# Create the version-hint.text file with content "1"
with open(version_hint_file, 'w') as f:
f.write('1')
print(f"Created {version_hint_file} with content '1'")
def get_iceberg_tables(database_path):
"""
Connect to the SQLite database and retrieve the list of Iceberg tables.
Parameters:
database_path (str): The path to the SQLite database file.
Returns:
list: A list of dictionaries, each representing an Iceberg table.
"""
# Connect to the SQLite database
con_meta = sqlite3.connect(database_path)
con_meta.row_factory = sqlite3.Row
# Create a cursor object to execute SQL queries
cursor = con_meta.cursor()
# Query to list all tables in the database
query = 'SELECT * FROM "iceberg_tables" ORDER BY "catalog_name", "table_namespace", "table_name";'
# Execute the query
cursor.execute(query)
# Fetch all results
results = cursor.fetchall()
# Convert results to list of dictionaries
table_list = []
for row in results:
row_dict = {key: row[key] for key in row.keys()}
table_list.append(row_dict)
# Close the connection
con_meta.close()
return table_list
Usage:
database_path = "/your/path"
# Retrieve the list of Iceberg tables
tables = get_iceberg_tables(database_path)
# Create metadata for each table
create_metadata_for_tables(tables)
# Print the final tables list
for table in tables:
print(table)
I can confirm that the issue persists on duckdb v1.0.0 1f98600c2c
and the getting started examples from the Apache Iceberg docs using local minio. the file lives in the bucket warehouse
with the full URI s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json
, I can confirm I can access the file using the AWS CLI with the same path.
here's what I did in duckdb cli:
INSTALL iceberg;
LOAD iceberg;
INSTALL httpfs;
LOAD httpfs;
CREATE SECRET secret1 (
TYPE S3,
KEY_ID 'key-here',
SECRET 'secret-here',
REGION 'us-east-1',
ENDPOINT '127.0.0.1:9000',
USE_SSL 'false'
);
SELECT * FROM iceberg_scan('s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json');
> HTTP Error: Unable to connect to URL "http://warehouse.minio.iceberg.orb.local:9000/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json": 404 (Not Found)
My s3 bucket with iceberg (picture below) cannot be queried with
iceberg_scan('s3://bucket/iceberg', ALLOW_MOVED_PATHS=true)
nor
iceberg_scan('s3://bucket/iceberg/*', ALLOW_MOVED_PATHS=true)
In particular the system is trying to find a very specific file (so the * pattern gets ignored):
duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL https://bucket.s3.amazonaws.com/iceberg/metadata/version-hint.text
Unfortunately that file does not exist in my iceberg/ folder, nor in any of the iceberg/sub/metadata folders. Compared to the data zip in duckdb docs about iceberg, it is clear "my iceberg tables" are missing that file, which is important for the current implementation.
That said, version-hint seems something we do not really need, as that info can default to a version or being an additional parameter perhaps (instead of failing if the file is not found)?
Original discussion with @Alex-Monahan in dbt Slack is here: note that I originally got pointed to this as a possible cause, so perhaps reading a table that is formally Iceberg is not really independent from the data catalog it belongs to?