deephaven / deephaven-core

Deephaven Community Core
Other
257 stars 80 forks source link

fix!: ensure Iceberg layouts own the SeekableChannelsProvider #6371

Closed devinrsmith closed 6 days ago

devinrsmith commented 1 week ago

This greatly improves the efficiency of Iceberg reading. Previously, it was creating a SeekableChannelsProvider per URI, and now only one is created once per layout (/ Table).

To aid in this update, objects that were previously created on-demand in IcebergBaseLayout are now created once upon construction. To enable this, it was noted that only the URI scheme is relevant for discrimination, and not actually the full URI to the data files. Thus, we can use the URI scheme as provided via org.apache.iceberg.Table#location to do any up-front loading.

The various interfaces that take a URI have been update to take a URI scheme instead. While this change could technically have been made in a non-breaking fashion by delegating existing URI methods to URI scheme methods, the existence of the URI methods encourages the wrong mental model and is easy to misuse, so they have been removed.

One of the ParquetTableLocationKey constructors has been deprecated, marked for removal. A more appropriate constructor has been added.

BREAKING CHANGE:

malhotrashivam commented 1 week ago

Changeset is fine, modulo incredibly minor quibbles. I guess the real question is, what do we give up? It seems to me like having access to the full URI is potentially more powerful, and that we're preventing "hybrid" Iceberg catalogs with more than one URI scheme in use (which is (1) something I kind of want to build, and (2) not something that would totally surprise me in the wild).

Good catch regarding URIs with different schemes, we can either keep a set of channel providers in the Base Layout class or, for now, add an assertion that all the location keys have same URI scheme as the table location.

devinrsmith commented 1 week ago

I've added an explicit URI check; it's a good point. That said, I'm almost certain that we don't support multiple URI schemes today given the pain points of configuration (not necessarily speaking of DH).