Can't get AWS Athena Iceberg metadata because there's no `metadata/version-hint.text` file

sacundim commented 11 months ago

I went ahead and gave this still-very-early extension a try, using my own Iceberg data set that I build with AWS Athena. No dice, I get this:

D SELECT *  FROM iceberg_metadata('s3://my-bucket/path/to/table/folder');
Error: Invalid Error: HTTP Error: Unable to connect to URL "https://my-bucket.s3.amazonaws.com/path/to/table/folder/metadata/version-hint.text": 404 (Not Found)

Looking at the current source code I see that it comes from here.

Searching the web a bit for relevant info I find this passage in this article:

Iceberg catalog

Within the catalog, there is a reference or pointer for each table to that table’s current metadata file. For example, in the diagram shown above, there are 2 metadata files. The value for the table’s current metadata pointer in the catalog is the location of the metadata file on the right.

What this data looks like is dependent on what Iceberg catalog is being used. A few examples:

With HDFS as the catalog, there’s a file called version-hint.text in the table’s metadata folder whose contents is the version number of the current metadata file.

With Hive metastore as the catalog, the table entry in the metastore has a table property which stores the location of the current metadata file.

With Nessie as the catalog, Nessie stores the location of the current metadata file for the table.

Obviously Athena is somehow storing this metadata in the AWS Glue Catalog (perhaps accessible with the same protocol as a Hive metastore?), and the extension doesn't integrate that yet.

I do realize this is an extremely early stage for the extension, just holler in this ticket when you think it might work (or just if you're puzzled whether it might) and I'm happy to test this out some other time.

samansmink commented 11 months ago

hey @sacundim thanks for trying out the extension! Indeed currently only the version-hint.txt method is supported, and we should definitely look into the others for this extensions to be practical

alanpaulkwan commented 10 months ago

I have this exact same issue, and I wrote mine via StarRocks + REST. They only write .json and .avro files.

ccollingwoodAutodesk commented 10 months ago

There's a related discussion going on about PyIceberg reading static metadata files over in the Iceberg project.

https://github.com/apache/iceberg/issues/7979 also links to https://github.com/apache/iceberg/issues/1496

Just to restate the problem...

Some services that manage Iceberg tables do not emit a metadata/version-hint.text file to track the latest version of the Iceberg table so that the latest manifest file path can be constructed for static table reads.

For example, Iceberg tables managed by AWS Athena/Glue do not have this metadata/version-hint.text available.

[!IMPORTANT] It is important to note that the metadata/version-hint.text file does not seem to be part of the Iceberg spec and as a result no guarantees are made about if, how or when an Iceberg manager service emits this helper file.

Mentioned across both posts are ideas to tackle this issue, and why they are good ideas or not. Everything mentioned below is related to static table access (metadata file), and not relevant when using the Iceberg manager's APIs where table state and versions are managed by the Iceberg service.

Instrument a way for the query-engine to LIST the contents of the Iceberg table's /metadata/ prefix and use the latest manifest file matching the *.metadata.json pattern.

Unsure if this is a deterministic way to start interacting with the table. The manifest might be written to object storage before the table changes are ready. The Iceberg spec will probably indicate whether assuming the presence of the latest manifest file guarantees the integrity of the metadata therein.

Object storage LIST operations could be expensive (not to mention just another operation adding additional overhead to the query) especially if the Iceberg table experiences lots of churn and is configured to retain lots of versions or time-travel over large periods of time.

Every query interaction will probably require a LIST operation especially if the Iceberg table has no time-travel, low version retention, frequent cleanup and lots of churn. A manifest that was used successfully in one request may not be there for the next request. Unsure if there is any caching in this extension anyways.
Allow the Iceberg table functions to take a specific manifest file key as an argument.

The onus is then on the caller to ensure that they're using the appropriate (latest) manifest file when interacting with the Iceberg table. Becomes a little more difficult to create database/schema objects like views or prepared statements that might use Iceberg table functions.

Personally I think that if a specific manifest file key or version is provided to the Iceberg table functions it should be respected. A couple solutions or ideas are:

Function option flag (metadata_file=true) to indicate that a specific file key has been provided.
Or invert the default logic so that the hint indicates how to assume version (hint=true).
Dedicated table function (ICEBERG_METADATA_FILE(...) or ICEBERG_METADATA_VERSION(...))
Function argument for version (version='E84-274-C02') or metadata file key (metadata_file='E84-274-C02.metadata.json').

Likewise, if doing a LIST on the metadata prefix is an acceptable idea or desired then perhaps

Provide a function argument for listing/guessing the latest version (assume_latest=true).
Additionally providing a list limit argument to prevent unlikely, but possible runaway S3 costs (list_limit=100). Remember, each interaction with the Iceberg table function could in theory LIST the metadata prefix every time.

Or maybe there should be some kind version strategy argument such as version_strategy='list|version|key?

Additional, much like the filename=true seen elsewhere in DuckDB read-file table functions, it might be a good idea to provide a function option (return_version=true) so that the version of the Iceberg table is returned along with the results used in a query so that if callers are running a series of interrelated queries against the table, they're doing so with the same underlying data (version). Theoretically, the table's data could change between queries yielding divergent results.

None of the above should be taken as criticism. I appreciate and respect all the hard work on this iceberg extension. And if I knew C++ or had time, I would dive in and start contributing myself.

ccollingwoodAutodesk commented 10 months ago

Reading through the code, I just noticed that it's assumed that version is numeric. Also not the case with AWS Athena/Glue.

erikcw commented 9 months ago

Reading through the code, I just noticed that it's assumed that version is numeric. Also not the case with AWS Athena/Glue.

Just adding an example to illustrate. My table was created using Trino (which is the same system that underlies AWS Athena), and the manifests on S3 are in the form: s3://bucket/path/metadata/00002-9df20621-4ab7-4ce3-a3b9-c7e72f4e7447.metadata.json.

Adding a version-hint.txt containing 00002-9df20621-4ab7-4ce3-a3b9-c7e72f4e7447 results in the following error from duckdb:

In [58]: duckdb.query(f"SELECT * FROM iceberg_snapshots('s3://BUCKET/path/table')")
Out[58]: ---------------------------------------------------------------------------
InvalidInputException                     Traceback (most recent call last)
File ~/.pyenv/versions/ret/lib/python3.11/site-packages/IPython/core/formatters.py:708, in PlainTextFormatter.__call__(self, obj)
    701 stream = StringIO()
    702 printer = pretty.RepresentationPrinter(stream, self.verbose,
    703     self.max_width, self.newline,
    704     max_seq_length=self.max_seq_length,
    705     singleton_pprinters=self.singleton_printers,
    706     type_pprinters=self.type_printers,
    707     deferred_pprinters=self.deferred_printers)
--> 708 printer.pretty(obj)
    709 printer.flush()
    710 return stream.getvalue()

File ~/.pyenv/versions/ret/lib/python3.11/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File ~/.pyenv/versions/ret/lib/python3.11/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

InvalidInputException: Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result
Error: Invalid Error: HTTP Error: Unable to connect to URL "https://BUCKET.s3.amazonaws.com/path/table/metadata/v2.metadata.json": 404 (Not Found)

xerial commented 9 months ago

This problem is not only present in AWS Athena, but also in Iceberg tables created with JDBCCatalog https://iceberg.apache.org/docs/latest/jdbc/. Usually, these table catalogs do not generate a version-hint.text file.

dioptre commented 9 months ago

100% if we provide a metadata.json - it should just use it.

samansmink commented 9 months ago

Thanks for the discussion here everyone! I have a simple PR up adding support for the workaround to directly scan the metadata file as suggested here.

Next step should be to start adding support for various catalogs. I think we can skip any attempts to read the table statically through S3 listing for now, that seems a bit wasteful on resources indeed and maybe only distracts from the fact that we just need to add support for catalogs.

I'll make sure this makes it into the upcoming bugfix release of DuckDB which is currently scheduled for next week.

ccollingwoodAutodesk commented 9 months ago

Thanks @samansmink. That totally makes sense from a resources/priorities perspective.

I'd just like to add that for my own personal use case, being able to use the manifest (static table) is sometimes quite useful, especially during the development/debugging phase of some of our work. Unsure how others feel.

I recognize that reading from the catalog is far more important and should have way more priority and allocated resources than my tenuous use case of reading a static table. I just wanted to say that it is useful sometimes.

samansmink commented 9 months ago

@ccollingwoodAutodesk That's a good point, we can definitely consider adding it at some point.

With the addition of directly passing a metadata version file, I will close this issue for now in favor of the currently opened issues asking for specific catalog support (currently REST and AWS Glue). Feel free to request others if you think they also deserve priority.

duckdb / duckdb_iceberg

Can't get AWS Athena Iceberg metadata because there's no `metadata/version-hint.text` file #10

Iceberg catalog