apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.18k stars 2.15k forks source link

Support relative paths in Table Metadata #1617

Open pavibhai opened 3 years ago

pavibhai commented 3 years ago

Background

Iceberg specification captures file references for the following:

All the file references are absolute paths right now.

Challenge

Table copy to another location could arise in the following use cases:

Absolute file references require the file references to reflect the new target location of the table before the table can be consumed.

Solution Option

We could support relative paths as follows:

We might further consider splitting the Table metadata into two pieces:

This will ensure the following:

jackye1995 commented 3 years ago

I remember we discussed this use case some time ago in a sync up meeting, and the related issue was #1531. The general feedback was that it is better to expose this as a Spark procedure, and rewrite the manifests with new file path URIs to reflect new location after replication or backup. This avoids the trouble of redefining the Iceberg spec. Any thoughts?

rdblue commented 3 years ago

I don't think it is a good idea in general to use relative paths. We recently had an issue where using a hdfs location without authority caused a user's data to be deleted by the RemoveOrphanFiles action because the resolution of the table root changed. The main problem is that places in Iceberg would need to have some idea of "equivalent" paths and path resolution. Full URIs are much easier to work with and more reliable.

But there is still a way to do both. Catalogs and tables can inject their own FileIO implementation, which is what is used to open files. That can do any resolution that you want based on environment. So you could use an implementation that allows you to override a portion of the file URI and read it from a different underlying location. I think that works better overall because there are no mistakes about equivalent URIs, but you can still read a table copy without rewriting the metadata.

pavibhai commented 3 years ago

@jackye1995 and @rdblue Thanks for your feedback.

Our current thought process is as follows:

  1. Only the Table Path be an absolute path, with all other paths in the metadata files being relative paths
  2. The relative path should not be visible outside of the metadata component i.e. the relative path should be translated to an absolute path using the table path for other areas of Iceberg to consume, so only the metadata files shall store the relative path.
  3. The relative path references shall be confined to be within the table path i.e ../ shall not be allowed.

I don't think the scenario called out about a wrong HDFS location being cleaned will be worsened or improved as a result of this change. If the absolute path of the table is incorrectly interpreted to be something else then problems shall still happen.

We will explore the option of FileIO in parallel while we discuss further the complications of using relative paths.

omalley commented 3 years ago

Relative paths would help us too.

We certainly need to have support for a federated namespace with a virtual file system. We're moving to HDFS federation and moving our tables to be located in a virtual filesystem (eg: gridfs://cluster1/table ) so that we can move the data around without changing all of the metadata. The challenge here is that we need to make sure that the manifests don't contain the physical hdfs path, which is likely to break.

For data access control, it seems like a really good invariant that all data is under the table's path.

rdblue commented 3 years ago

I'll add this as a discussion topic for the next Iceberg sync.

anuragmantri commented 3 years ago

We've put together a design doc for this proposal. We thought through a few scenarios and added an implementation. We will be glad to consider other ideas as well.

jackye1995 commented 3 years ago

Thanks for the design doc! I wonder if this can be done through changes in Catalog and FileIO without touching the Iceberg specification, which can be much simpler to ship without conflicting with already quite a few changes for the format v2 specifications.

The key difficulty I see is that table property is a part of the table metadata file. Therefore we cannot know if a table should use any special file path override without actually reading that file. If we have a way to know it without using FileIO, then we can decide for a specific table, all files should be read through a region-specific client for replication, or a different URI scheme for backup and HDFS federation.

So here is my alternative proposal:

  1. add interface TablePropertyIO that is dedicated to read and write table properties.
  2. update catalog implementations (mostly in BaseMetastoreCatalog) to use TablePropertyIO when necessary, by adding it as a part of the TableOperations interface similar to FileIO io().
  3. add new table property table-io-impl, that can be customized to do things like replicaiton, backup, etc.
  4. implement those specific table FileIO classes.

Could this proposal satisfy your requirements?

anuragmantri commented 3 years ago

Thanks for your comments @jackye1995. We discussed your proposal and here are our thoughts.

With our approach,

For the reasons above, we prefer the approach laid out in the design doc. We understand the concerns around Iceberg spec with v2 format already changing a lot. We are open to ideas on shipping this with the least interference (even if it means we wait till v2 gets out).

Please let us know your thoughts.

rdblue commented 3 years ago

I read the doc through the Proposal and Approach sections and I am quite confused about what your proposal is. Could you clarify it a bit? It may be obvious from the following example sections what the changes are and why they are needed, but I think that information should be in the proposal/approach.

anuragmantri commented 3 years ago

Thanks for the initial feedback @rdblue. I rephrased the proposal and approach sections. Could you please take another look at the document?

Thanks, Anurag

aokolnychyi commented 3 years ago

@flyrain, could you take a look at what can be done here?

anuragmantri commented 3 years ago

@flyrain, in my latest comment to Ryan's comment in the design doc, I proposed starting with "supporting relative paths in metadata" which does not need changes to the spec. We can then look at refactoring the fields of metadata.json file. This will make it easier to implement and review the change. What are your thoughts?

flyrain commented 3 years ago

Thanks @anuragmantri for filing the PR and updating the design doc. It is great that there is no spec change in the latest design.

rdblue commented 3 years ago

I looked through the design doc. It seems to be much better than the last version, so thank you for the update.

Overall, I think it is still incorrect on a few points and is quite a bit longer than it will need to be in the end. The main confusion seems to be where the table location comes from. Iceberg table metadata tracks a location string that is written into all metadata.json files. This is the table's location. Iceberg does not require any location other than this one and there isn't a way to pass a different location in the TableOperations API; the only way to pass a table location is through TableMetadata.

There are also a few of table properties that can affect locations:

There are two broad classes of tables:

  1. Hadoop tables identify the table itself by location and enforce that the location in TableMetadata is identical to the location used to identify the table when it is created. For Hadoop tables, it makes sense that the table location used for relative paths is the location that gets passed to create HadoopTableOperations. But then there could be a problem that table metadata actually points to a different location unless it is modified.
  2. Metastore tables do not have a location that is external to table metadata. However, when a metastore table is tracked by Hive, there is a location that Hive tracks and we set that to the table's location.

For table metadata files, there is only an external location for Hadoop tables. Otherwise, the table location must come from the table's metadata.json file. That makes it difficult to make metadata_location and previous_metadata_location relative paths. I would probably not attempt to make these paths relative, but I'm open to ideas.

We could update the spec to allow relative metadata.json paths for Hadoop tables, but will need to state how to handle a different location in the metadata file. (Ignore?)

Metadata locations aside, I think that it is a good plan to make the relative paths transparent. Most of the library should continue to operate on absolute paths and relative paths should be made while writing and made absolute while reading. That makes it so the changes are fairly easily tested. If you do this, then I think there is a lot less to cover in the design doc.

rdblue commented 3 years ago

@flyrain and @anuragmantri, see my comment above. Thanks for working on this!

pvary commented 3 years ago

@rdblue: We would like to use relative path for replication of the Iceberg tables between on-prem Hive instances, or moving on-prem Iceberg tables to the cloud and sometimes even moving data from the cloud to on-prem.

While we can do this by rewriting the manifest and snapshot files, it would be much easier to synchronize only the files themselves and update the metadata_location / previous_metadata_location for the target table and be done with it.

Because of the use-cases above it would be useful to have relative paths in the Iceberg metadata for Metastore tables as well, and the absolute path could be generated using the Metastore table location. Maybe even the metadata_location / previous_metadata_location could be generated this way, so the tables on both end of the replication could have the same data and metadata too.

rdblue commented 3 years ago

@pvary, we can explore that, but it is fairly easy to write a new metadata file and use that for a table. At least that's a small operation, compared to rewriting all metadata and position delete files. My main point is that we need to have a well-documented plan for tracking the table location.

anuragmantri commented 3 years ago

Thanks @rdblue for explaining how locations are tracked in Iceberg tables. Based on your and @pvary's comments, I updated the design doc to include

What are your thoughts?

flyrain commented 3 years ago

For table metadata files, there is only an external location for Hadoop tables. Otherwise, the table location must come from the table's metadata.json file. That makes it difficult to make metadata_location and previous_metadata_location relative paths. I would probably not attempt to make these paths relative, but I'm open to ideas.

Hi @rdblue, we might leave metadata_location and previous_metadata_location as is, they are table properties in Metastore like HMS, are not affected when we just move files from the source table to the target. They are absolute path, but they still point to the right place without any change.

pvary commented 3 years ago

@pvary, we can explore that, but it is fairly easy to write a new metadata file and use that for a table. At least that's a small operation, compared to rewriting all metadata and position delete files.

@rdblue: I am not sure that I understand what are you proposing here. I might be mistaken, but AFAIK we store the locations of the files in the snapshot file, in the manifest files and also in v2 the position delete files as well. If we want to create a usable full copy of the table then currently we need to do the following:

Is there an easier way to create a full working replica of an Iceberg table where we do not use any files/data from the original table and the 2 tables (original and the new) can live independently after the creation of the replica?

My main point is that we need to have a well-documented plan for tracking the table location.

Totally agree with this 👍

Thanks, Peter

rdblue commented 3 years ago

@pvary, I'm talking only about the metadata.json files. I agree that using relative paths for manifest lists, manifests, and data files is a good idea. But we keep the table location in the root metadata file and so it is difficult to make its location relative.

I think we could do this for Hadoop tables by ignoring the location in metadata.json because it must match the table location. But Hadoop tables are mostly for testing and I don't recommend using them in practice. The bigger issue is how to track the table location for Hive or other metastore tables. In that case, I don't think it is unreasonable to use updateLocation to change the location of the table, but it is definitely inconvenient.

flyrain commented 3 years ago

Is there an easier way to create a full working replica of an Iceberg table where we do not use any files/data from the original table and the 2 tables (original and the new) can live independently after the creation of the replica?

@pvary , ideally, table replication doesn't involve data file rewrite and metadata(manifest-list, manifest, metadata.json) rewrite. The process would be as simple as that user copys all files needed, then changes the target table properties to get the new status. It isn't the case in reality though.

In this issue thread, we were talking about two ways to replicate a table. 1. relative path 2. rebuild the metadata files. Neither of them require data file rewrite. However, the relative-path approach requires the minimal metadata file rewrite, probably only metadata.json per our discussion. But metadata-rebuild approach involves rewrite of all three type of metadata files. They are metadata.json, manifest-list, and manifest. Every type of file stores table information cannot be recreated only by looking at the data files. For example, the partition spec in metadata.json and its id in manifest file, and the snapshot relative metadata.

To your question, both source and target tables should be able to live independently after the replication. That's relative easy to archive. The hard part is to enable incremental sync-up between them and bidirectional replication, which are quite common DR(Disaster recovery) use cases.

pvary commented 3 years ago

Thanks for the detailed answer @flyrain! This really helps to have a clear understanding of the tasks at hand.

I would like to share my thoughts about 2 points:

The hard part is to enable incremental sync-up between them and bidirectional replication, which are quite common DR(Disaster recovery) use cases.

I think that if we chose the relative path approach then replication becomes quite straightforward, since we do not have to handle the mix of different paths (source absolute path for the new metadata files, destination absolute path for the old metadata files). We just need to copy the metadata files for one directional replication. The bidirectional replication is a different kettle of fish because of the commit resolution complexity, but I think it is also easier since we do not have to care about manifests and manifest-lists

However, the relative-path approach requires the minimal metadata file rewrite, probably only metadata.json per our discussion.

What do we want to change in the json? Is it only the path of the table, or we have to rewrite something else as well? Could we use something like the LocationProvider (which generates new datafile locations) to make the path resolution pluggable and store only the config in the table?

Thanks, Peter

jackye1995 commented 3 years ago

Just trying to catch up with the thread, and reread the design doc.

For the design doc, I think we should not make HiveTable a separated case to discuss, all tables that are retrieved through a Catalog implementation should operate the same if we introduce relative path, we should not try to break the abstraction layer and treat Hive tables separately.

To have a brief summary (and also potentially answer Peter's question), from my understanding we are leading towards a solution that following items to keep track of the true table roots, and should be absolute, while all other paths can be relative:

  1. location
  2. write.metadata.path
  3. write.folder-storage.path
  4. write.object-storage.path

Then these 4 properties are used to get 2 absolute root paths:

  1. root path to write data: the LocationProvider would read some fields in this list to determine the base root path, and generate absolute paths to write data files.
  2. root path to write metadata: similarly, the TableOperations.metadataFileLocation reads some fields above to determine the base root path, and generate paths for metadata files.

These probably should be exposed as new methods to make sure the same location derivation logic is used everywhere.

Once the content of a file is written, the paths are written to one metadata layer above with/without absolute path based on a flag proposed in the design doc, and the writer of this path needs to know what are the true root paths used in the 2 places above. On the read side, the absolute path is created based on the 2 absolute root paths. Those operations are done inside each specific IO method by passing the relative/absolute path, the correct absolute root path, and the boolean flag, to make sure operations are transparent above the IO level.

All updates to the 4 locations on the top can be done in a UpdateProperties + UpdateLocation transaction, and when the metadata path is updated, the metadata_location in catalog will be automatically updated with the new file root path to write and then read the latest metadata json file that is now in anther root path. Therefore, we need to make sure that when this value is updated, the metadata json file writer uses the value in the update instead of the existing value. This should be able to switch a reader of Iceberg table from one replica to another explicitly, or implement try-and-error logic to switch to the replica after certain retries. I am not sure if there is a use case such as choosing a random replica to read file, but I think that should be solved by directly registering all replicas as different tables and round-robin on the reader side.

I think if we go with this route then it sounds good to me, and we can try to hash out some more details in PRs. Please let me know if this summary is accurate, or if I have any misunderstanding.

pvary commented 3 years ago

Thanks @jackye1995 for highlighting the properties which we need to handle!

I have not know about the write.object-storage.path before. Do I understand correctly that this is a LocationProvider specific parameter for ObjectStoreLocationProvider which is more or less the same than the write.object-storage.path for the DefaultLocationProvider? If so it highlights how tricky could it be if we want to make sure that every LocationProvider specific parameter is handled correctly when we are updating the locations in the metadata json. Maybe we should delegate this path generation to the LocationProvider altogether, or something similar.

All updates to the 4 locations on the top can be done in a UpdateProperties + UpdateLocation transaction

The process you have described above sounds useful to me if we have a place where both of the replicas are accessible, like a central node where we can initiate the copy of the data and then call UpdateProperties + UpdateLocation, so the replica could have it's own metadata json which holds the correct info.

In our cases for the on-prem to cloud migration it is possible that the file copy is done by a different method and then we need to read and update the metadata json file, in which the locations are not accessible. So it would be good if we can run UpdateProperties + UpdateLocation on an "invalid" table.

jackye1995 commented 3 years ago

If so it highlights how tricky could it be if we want to make sure that every LocationProvider specific parameter is handled correctly when we are updating the locations in the metadata json

Yes, that is why I was proposing if we decide to go with relative path, LocationProvider should expose a method that describes the true file path root it is using, maybe something like LocationProvider.root(), and any plugin implementation can override this method if necessary. But LocationProvider only governs the path for data files, and metadata file paths are governed by TableOperations.metadataFileLocation(String fileName), so we should also have a new method likeTableOperations.metadataRoot()` to describe the true root. Those 2 roots should be sufficient for any reader and writer to use if they stick with the Iceberg core IO library.

On Trino side it's quite more complicated because some paths are hard coded and directly generated based on Hive conventions, but I think that is an issue to fix on Trino side: https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java#L303-L308

But just keep in mind that there is always a risk for a compute engine to not stick with those things, and it will be hard to enforce.

flyrain commented 3 years ago

@pvary, yes, the relative path approach looks more elegant than the metadata rebuild one. As I mentioned, it is close to the ideal case, which doesn't rewrite the metadata files.

What do we want to change in the json? Is it only the path of the table, or we have to rewrite something else as well? Could we use something like the LocationProvider (which generates new datafile locations) to make the path resolution pluggable and store only the config in the table?

Need to change the location in metadata.json, or a combination of three table properties mentioned by @rdblue and @jackye1995.

Thanks @jackye1995 's proposal. It makes sense to have a centralized place to get the root for data and metadata since there are multiple ways to specify their locations. We can also hide the logic of how to prioritize them. cc @anuragmantri.

pvary commented 3 years ago

Why just get the root for the data/metadata from the LocationProvider?

Why not expose these methods instead:

// initialize the provider with the root paths and the `useRelativePath` config,
// or have different implementations like:
// - AbsoluteLocationProvider,
// - RelativeLocationProvider?
LocationProvider(Properties tableProperties);

// generate the absolute path for the metadata files
Path metaDataLocation(String file);
// generate the absolute path for the metadata files
Path dataLocation(String file);

// generate the value we put into the avro files - absolute or relative path
String metaDataKey(Path metadataFile);
// generate the value we put into the avro files - absolute or relative path
String dataKey(Path metadataFile);
anuragmantri commented 3 years ago

Catching up on the comments from past couple of days. Thanks everyone for providing inputs.

@flyrain - Although it will increase the scope of this design to include bi-directional replication, we should consider covering that here.

Overall, I agree with @jackye1995's proposal. Thanks for the detailed explanation.

Then these 4 properties are used to get 2 absolute root paths:

root path to write data: the LocationProvider would read some fields in this list to determine the base root path, and generate absolute paths to write data files. root path to write metadata: similarly, the TableOperations.metadataFileLocation reads some fields above to determine the base root path, and generate paths for metadata files.

In the initial design doc, my thinking was the writes mentioned above need to change at all. However, reading the comments from @pvary above, we may indeed have to store a root location somewhere for cases where we want to read inaccessible paths. Whether we want to have explicit method to set this root before accessing a table or have a mechanism for replicated tables to register themselves as replicas and then use a retry logic can be discussed.

My takeaways from this are:

  1. Metadata writers need to know a) Relative path boolean b) True root of the table c) Metadata path if set Additionally, they will also convert paths to relative in metadata files.

  2. Data writers need to know a) Data path if set b) True root of the table

An update transaction will change all the above except relative path boolean. Please correct me if my understanding is incorrect.

anuragmantri commented 3 years ago

Why not expose these methods instead:

// initialize the provider with the root paths and the useRelativePath config, // or have different implementations like: // - AbsoluteLocationProvider, // - RelativeLocationProvider? LocationProvider(Properties tableProperties);

// generate the absolute path for the metadata files Path metaDataLocation(String file); // generate the absolute path for the metadata files Path dataLocation(String file);

// generate the value we put into the avro files - absolute or relative path String metaDataKey(Path metadataFile); // generate the value we put into the avro files - absolute or relative path String dataKey(Path metadataFile);

@pvary - This may make the code change larger. Other than that, I don't see why this cannot be done unless I'm missing something.

pvary commented 3 years ago

One more use-case which also means that we have to manipulate path in the snapshot/metadata files is when we would like to enable temporary access for Iceberg tables stored on blobstores. In this case we have to create presigned URLs to access the specific files and we have to use this URL in the metadata of the table

anuragmantri commented 2 years ago

The design doc has been updated to add initial support for relative paths and multiple root locations for a table. Please review the design.

asheeshgarg commented 2 years ago

can we use migrate_table stored procedure for this?

flyrain commented 2 years ago

@asheeshgarg, no, the procedure is mainly for migrating existing Hive or Spark tables to Iceberg, https://iceberg.apache.org/docs/latest/spark-procedures/#table-migration.

ksmatharoo commented 1 year ago

I don't think it is a good idea in general to use relative paths. We recently had an issue where using a hdfs location without authority caused a user's data to be deleted by the RemoveOrphanFiles action because the resolution of the table root changed. The main problem is that places in Iceberg would need to have some idea of "equivalent" paths and path resolution. Full URIs are much easier to work with and more reliable.

But there is still a way to do both. Catalogs and tables can inject their own FileIO implementation, which is what is used to open files. That can do any resolution that you want based on environment. So you could use an implementation that allows you to override a portion of the file URI and read it from a different underlying location. I think that works better overall because there are no mistakes about equivalent URIs, but you can still read a table copy without rewriting the metadata.

@rdblue We tried injecting own FileIO via spark config which will replace the table/metadata path prefix with new location, this works till reading table metadata but while reading parquet files it struck in BatchDataReader.java in following function, Please provide your thoughts on this if there is some other way of achieving this.

protected CloseableIterator open(FileScanTask task) { String filePath = task.file().path().toString(); LOG.debug("Opening data file {}", filePath);

// update the current file for Spark's filename() function
InputFileBlockHolder.set(filePath, task.start(), task.length());

Map<Integer, ?> idToConstant = constantsMap(task, expectedSchema());

/* below given code line is causing issue because its searching name given in metadata in the map which will be replaced by custom FileIO, changing this line to InputFile inputFile = table.io().newInputFile(filePath); is making it work but this remove the encryption logic, in short we couldn't make it work with only Custom FileIO /

InputFile inputFile = getInputFile(filePath);
Preconditions.checkNotNull(inputFile, "Could not find InputFile associated with FileScanTask");
SparkDeleteFilter deleteFilter =
    task.deletes().isEmpty()
        ? null
        : new SparkDeleteFilter(filePath, task.deletes(), counter());

return newBatchIterable(
        inputFile,
        task.file().format(),
        task.start(),
        task.length(),
        task.residual(),
        idToConstant,
        deleteFilter)
    .iterator();

}

jotarada commented 11 months ago

Hey any news related to this subject? This is relevant for us and would guess for other users when trying to introduce alluxio FS.

We would like to have a load process that would write to object storage (table A pointing to fs://location) and then create a new table table pointing to alluxio fs mounted in the object storage (table A1 pointing to alluxio://location) and read from the same files. The problem today is even if we get the new table location in alluxio the metadata files and manifest files will be pointing to the writing FS an not use the relative path from alluxio to get the files via alluxio and leverage the cache. Instead the files will be read from the FS ignoring alluxio, only using it to get the 1st metadata file.

Does this makes a case for the relative path for manifests and metadata files?

Or do you think there is another way to achieve such architecture?

abmo-x commented 10 months ago

@jotarada yes, relative path should help with this use case

However, you don't really need relative paths if you use the same paths in alluxio. you can just configure the s3 endpoint config to use alluxio endpoint and read from alluvia, you don't even have to create table A1

table A fs: fs://location/db/tableA/data/metadata alluxio: fs://location/db/tableA/data/metadata

set s3 endpoint to alluxio for read, in spark set 'spark.fs.s3a.endpoint' and that should read from alluxio instead of s3, no need for new table or relative paths.

jotarada commented 10 months ago

@abmo-x Can i do that on trino as well?

jotarada commented 7 months ago

Any updates on this?

anuragmantri commented 7 months ago

Hi @jotarada, unfortunately, I had to move on to other priorities and could not continue working on this issue. If you are interested, feel free to contribute, I will be happy to help review changes and provide any other support.

ksmatharoo commented 5 months ago

@jotarada We encountered a similar issue when we switched S3 vendors, necessitating the copying of all data to another bucket, resulting in unreadable iceberg tables. Currently, we have two options: either attempt to rewrite metadata with custom code or inject custom file IO. We have experimented with both approaches, and I can provide you with our code for reference. https://github.com/ksmatharoo/IcebergCustomFileIO https://github.com/ksmatharoo/IcebergMetadataRewrite

buremba commented 1 month ago

I'm trying to use Cloudflare R2 Sippy in front of S3, and I also run into this issue. Snowflake uses a full absolute path to S3 when generating manifest and when I try querying the Iceberg table on Duckdb via its own R2 integration, it still goes to S3.

I also tried downloading the files from S3 locally and query them in DuckDB and Clickhouse and they both try to pull the data from S3 while the copies exist locally. Unfortunately rewriting the metadata is not an option as we don't write the data but provide a read-only solution.

flyrain commented 1 month ago

I also tried downloading the files from S3 locally and query them in DuckDB and Clickhouse and they both try to pull the data from S3 while the copies exist locally.

That's right. The metadata files have absolute paths pointing to s3 locations. We need to revisit the relative path approach for your use case.

akizminet commented 1 month ago

We also face the same issue.

lightmelodies commented 1 week ago

For anyone who is interested: https://github.com/lightmelodies/iceberg-relative-io