Open pavibhai opened 4 years ago
I remember we discussed this use case some time ago in a sync up meeting, and the related issue was #1531. The general feedback was that it is better to expose this as a Spark procedure, and rewrite the manifests with new file path URIs to reflect new location after replication or backup. This avoids the trouble of redefining the Iceberg spec. Any thoughts?
I don't think it is a good idea in general to use relative paths. We recently had an issue where using a hdfs
location without authority caused a user's data to be deleted by the RemoveOrphanFiles
action because the resolution of the table root changed. The main problem is that places in Iceberg would need to have some idea of "equivalent" paths and path resolution. Full URIs are much easier to work with and more reliable.
But there is still a way to do both. Catalogs and tables can inject their own FileIO
implementation, which is what is used to open files. That can do any resolution that you want based on environment. So you could use an implementation that allows you to override a portion of the file URI and read it from a different underlying location. I think that works better overall because there are no mistakes about equivalent URIs, but you can still read a table copy without rewriting the metadata.
@jackye1995 and @rdblue Thanks for your feedback.
Our current thought process is as follows:
../
shall not be allowed.I don't think the scenario called out about a wrong HDFS location being cleaned will be worsened or improved as a result of this change. If the absolute path of the table is incorrectly interpreted to be something else then problems shall still happen.
We will explore the option of FileIO in parallel while we discuss further the complications of using relative paths.
Relative paths would help us too.
We certainly need to have support for a federated namespace with a virtual file system. We're moving to HDFS federation and moving our tables to be located in a virtual filesystem (eg: gridfs://cluster1/table ) so that we can move the data around without changing all of the metadata. The challenge here is that we need to make sure that the manifests don't contain the physical hdfs path, which is likely to break.
For data access control, it seems like a really good invariant that all data is under the table's path.
I'll add this as a discussion topic for the next Iceberg sync.
We've put together a design doc for this proposal. We thought through a few scenarios and added an implementation. We will be glad to consider other ideas as well.
Thanks for the design doc! I wonder if this can be done through changes in Catalog and FileIO without touching the Iceberg specification, which can be much simpler to ship without conflicting with already quite a few changes for the format v2 specifications.
The key difficulty I see is that table property is a part of the table metadata file. Therefore we cannot know if a table should use any special file path override without actually reading that file. If we have a way to know it without using FileIO, then we can decide for a specific table, all files should be read through a region-specific client for replication, or a different URI scheme for backup and HDFS federation.
So here is my alternative proposal:
TablePropertyIO
that is dedicated to read and write table properties.BaseMetastoreCatalog
) to use TablePropertyIO
when necessary, by adding it as a part of the TableOperations
interface similar to FileIO io()
.table-io-impl
, that can be customized to do things like replicaiton, backup, etc.FileIO
classes.Could this proposal satisfy your requirements?
Thanks for your comments @jackye1995. We discussed your proposal and here are our thoughts.
FileIO
implementations to internally interpret them differently can cause confusion since the same path could mean totally different locations based on the underlying implementation. With our approach,
For the reasons above, we prefer the approach laid out in the design doc. We understand the concerns around Iceberg spec with v2 format already changing a lot. We are open to ideas on shipping this with the least interference (even if it means we wait till v2 gets out).
Please let us know your thoughts.
I read the doc through the Proposal and Approach sections and I am quite confused about what your proposal is. Could you clarify it a bit? It may be obvious from the following example sections what the changes are and why they are needed, but I think that information should be in the proposal/approach.
Thanks for the initial feedback @rdblue. I rephrased the proposal and approach sections. Could you please take another look at the document?
Thanks, Anurag
@flyrain, could you take a look at what can be done here?
@flyrain, in my latest comment to Ryan's comment in the design doc, I proposed starting with "supporting relative paths in metadata" which does not need changes to the spec. We can then look at refactoring the fields of metadata.json file. This will make it easier to implement and review the change. What are your thoughts?
Thanks @anuragmantri for filing the PR and updating the design doc. It is great that there is no spec change in the latest design.
I looked through the design doc. It seems to be much better than the last version, so thank you for the update.
Overall, I think it is still incorrect on a few points and is quite a bit longer than it will need to be in the end. The main confusion seems to be where the table location comes from. Iceberg table metadata tracks a location string that is written into all metadata.json files. This is the table's location. Iceberg does not require any location other than this one and there isn't a way to pass a different location in the TableOperations
API; the only way to pass a table location is through TableMetadata
.
There are also a few of table properties that can affect locations:
write.metadata.path
, which is defaulted to the metadata
folder under the table locationwrite.folder-storage.path
, which is defaulted to the data
folder under the table locationwrite.object-storage.path
There are two broad classes of tables:
TableMetadata
is identical to the location used to identify the table when it is created. For Hadoop tables, it makes sense that the table location used for relative paths is the location that gets passed to create HadoopTableOperations
. But then there could be a problem that table metadata actually points to a different location unless it is modified.For table metadata files, there is only an external location for Hadoop tables. Otherwise, the table location must come from the table's metadata.json file. That makes it difficult to make metadata_location
and previous_metadata_location
relative paths. I would probably not attempt to make these paths relative, but I'm open to ideas.
We could update the spec to allow relative metadata.json paths for Hadoop tables, but will need to state how to handle a different location
in the metadata file. (Ignore?)
Metadata locations aside, I think that it is a good plan to make the relative paths transparent. Most of the library should continue to operate on absolute paths and relative paths should be made while writing and made absolute while reading. That makes it so the changes are fairly easily tested. If you do this, then I think there is a lot less to cover in the design doc.
@flyrain and @anuragmantri, see my comment above. Thanks for working on this!
@rdblue: We would like to use relative path for replication of the Iceberg tables between on-prem Hive instances, or moving on-prem Iceberg tables to the cloud and sometimes even moving data from the cloud to on-prem.
While we can do this by rewriting the manifest and snapshot files, it would be much easier to synchronize only the files themselves and update the metadata_location / previous_metadata_location for the target table and be done with it.
Because of the use-cases above it would be useful to have relative paths in the Iceberg metadata for Metastore tables as well, and the absolute path could be generated using the Metastore table location. Maybe even the metadata_location
/ previous_metadata_location
could be generated this way, so the tables on both end of the replication could have the same data and metadata too.
@pvary, we can explore that, but it is fairly easy to write a new metadata file and use that for a table. At least that's a small operation, compared to rewriting all metadata and position delete files. My main point is that we need to have a well-documented plan for tracking the table location.
Thanks @rdblue for explaining how locations are tracked in Iceberg tables. Based on your and @pvary's comments, I updated the design doc to include
For HadoopTables
HadoopTableOperations
will be used as base table location.location
, we will ignore it. (Do you see any issues?)For metastore tables
HiveTable
updateLocation()
API to create a new metadata.json file and use the location
attribute in that as base table location.location
attribute of the metadata.json file as the base table locationPath conversion is transparent to APIs. I removed sections of the doc that refer to APIs that do not change.
What are your thoughts?
For table metadata files, there is only an external location for Hadoop tables. Otherwise, the table location must come from the table's metadata.json file. That makes it difficult to make metadata_location and previous_metadata_location relative paths. I would probably not attempt to make these paths relative, but I'm open to ideas.
Hi @rdblue, we might leave metadata_location and previous_metadata_location as is, they are table properties in Metastore like HMS, are not affected when we just move files from the source table to the target. They are absolute path, but they still point to the right place without any change.
@pvary, we can explore that, but it is fairly easy to write a new metadata file and use that for a table. At least that's a small operation, compared to rewriting all metadata and position delete files.
@rdblue: I am not sure that I understand what are you proposing here. I might be mistaken, but AFAIK we store the locations of the files in the snapshot
file, in the manifest
files and also in v2 the position delete
files as well. If we want to create a usable full copy of the table then currently we need to do the following:
snapshot
, manifest
files and position delete
files)Is there an easier way to create a full working replica of an Iceberg table where we do not use any files/data from the original table and the 2 tables (original and the new) can live independently after the creation of the replica?
My main point is that we need to have a well-documented plan for tracking the table location.
Totally agree with this 👍
Thanks, Peter
@pvary, I'm talking only about the metadata.json files. I agree that using relative paths for manifest lists, manifests, and data files is a good idea. But we keep the table location in the root metadata file and so it is difficult to make its location relative.
I think we could do this for Hadoop tables by ignoring the location in metadata.json because it must match the table location. But Hadoop tables are mostly for testing and I don't recommend using them in practice. The bigger issue is how to track the table location for Hive or other metastore tables. In that case, I don't think it is unreasonable to use updateLocation
to change the location of the table, but it is definitely inconvenient.
Is there an easier way to create a full working replica of an Iceberg table where we do not use any files/data from the original table and the 2 tables (original and the new) can live independently after the creation of the replica?
@pvary , ideally, table replication doesn't involve data file rewrite and metadata(manifest-list, manifest, metadata.json) rewrite. The process would be as simple as that user copys all files needed, then changes the target table properties to get the new status. It isn't the case in reality though.
In this issue thread, we were talking about two ways to replicate a table. 1. relative path 2. rebuild the metadata files. Neither of them require data file rewrite. However, the relative-path approach requires the minimal metadata file rewrite, probably only metadata.json per our discussion. But metadata-rebuild approach involves rewrite of all three type of metadata files. They are metadata.json, manifest-list, and manifest. Every type of file stores table information cannot be recreated only by looking at the data files. For example, the partition spec in metadata.json and its id in manifest file, and the snapshot relative metadata.
To your question, both source and target tables should be able to live independently after the replication. That's relative easy to archive. The hard part is to enable incremental sync-up between them and bidirectional replication, which are quite common DR(Disaster recovery) use cases.
Thanks for the detailed answer @flyrain! This really helps to have a clear understanding of the tasks at hand.
I would like to share my thoughts about 2 points:
The hard part is to enable incremental sync-up between them and bidirectional replication, which are quite common DR(Disaster recovery) use cases.
I think that if we chose the relative path approach then replication becomes quite straightforward, since we do not have to handle the mix of different paths (source absolute path for the new metadata files, destination absolute path for the old metadata files). We just need to copy the metadata files for one directional replication. The bidirectional replication is a different kettle of fish because of the commit resolution complexity, but I think it is also easier since we do not have to care about manifests and manifest-lists
However, the relative-path approach requires the minimal metadata file rewrite, probably only metadata.json per our discussion.
What do we want to change in the json? Is it only the path of the table, or we have to rewrite something else as well? Could we use something like the LocationProvider (which generates new datafile locations) to make the path resolution pluggable and store only the config in the table?
Thanks, Peter
Just trying to catch up with the thread, and reread the design doc.
For the design doc, I think we should not make HiveTable
a separated case to discuss, all tables that are retrieved through a Catalog
implementation should operate the same if we introduce relative path, we should not try to break the abstraction layer and treat Hive tables separately.
To have a brief summary (and also potentially answer Peter's question), from my understanding we are leading towards a solution that following items to keep track of the true table roots, and should be absolute, while all other paths can be relative:
location
write.metadata.path
write.folder-storage.path
write.object-storage.path
Then these 4 properties are used to get 2 absolute root paths:
LocationProvider
would read some fields in this list to determine the base root path, and generate absolute paths to write data files. TableOperations.metadataFileLocation
reads some fields above to determine the base root path, and generate paths for metadata files.These probably should be exposed as new methods to make sure the same location derivation logic is used everywhere.
Once the content of a file is written, the paths are written to one metadata layer above with/without absolute path based on a flag proposed in the design doc, and the writer of this path needs to know what are the true root paths used in the 2 places above. On the read side, the absolute path is created based on the 2 absolute root paths. Those operations are done inside each specific IO method by passing the relative/absolute path, the correct absolute root path, and the boolean flag, to make sure operations are transparent above the IO level.
All updates to the 4 locations on the top can be done in a UpdateProperties + UpdateLocation transaction, and when the metadata path is updated, the metadata_location
in catalog will be automatically updated with the new file root path to write and then read the latest metadata json file that is now in anther root path. Therefore, we need to make sure that when this value is updated, the metadata json file writer uses the value in the update instead of the existing value. This should be able to switch a reader of Iceberg table from one replica to another explicitly, or implement try-and-error logic to switch to the replica after certain retries. I am not sure if there is a use case such as choosing a random replica to read file, but I think that should be solved by directly registering all replicas as different tables and round-robin on the reader side.
I think if we go with this route then it sounds good to me, and we can try to hash out some more details in PRs. Please let me know if this summary is accurate, or if I have any misunderstanding.
Thanks @jackye1995 for highlighting the properties which we need to handle!
I have not know about the write.object-storage.path
before. Do I understand correctly that this is a LocationProvider
specific parameter for ObjectStoreLocationProvider
which is more or less the same than the write.object-storage.path
for the DefaultLocationProvider
? If so it highlights how tricky could it be if we want to make sure that every LocationProvider
specific parameter is handled correctly when we are updating the locations in the metadata json. Maybe we should delegate this path generation to the LocationProvider
altogether, or something similar.
All updates to the 4 locations on the top can be done in a UpdateProperties + UpdateLocation transaction
The process you have described above sounds useful to me if we have a place where both of the replicas are accessible, like a central node where we can initiate the copy of the data and then call UpdateProperties + UpdateLocation, so the replica could have it's own metadata json which holds the correct info.
In our cases for the on-prem to cloud migration it is possible that the file copy is done by a different method and then we need to read and update the metadata json file, in which the locations are not accessible. So it would be good if we can run UpdateProperties + UpdateLocation on an "invalid" table.
If so it highlights how tricky could it be if we want to make sure that every LocationProvider specific parameter is handled correctly when we are updating the locations in the metadata json
Yes, that is why I was proposing if we decide to go with relative path, LocationProvider
should expose a method that describes the true file path root it is using, maybe something like LocationProvider.root()
, and any plugin implementation can override this method if necessary. But LocationProvider
only governs the path for data files, and metadata file paths are governed by TableOperations.metadataFileLocation(String fileName), so we should also have a new method like
TableOperations.metadataRoot()` to describe the true root. Those 2 roots should be sufficient for any reader and writer to use if they stick with the Iceberg core IO library.
On Trino side it's quite more complicated because some paths are hard coded and directly generated based on Hive conventions, but I think that is an issue to fix on Trino side: https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java#L303-L308
But just keep in mind that there is always a risk for a compute engine to not stick with those things, and it will be hard to enforce.
@pvary, yes, the relative path approach looks more elegant than the metadata rebuild one. As I mentioned, it is close to the ideal case, which doesn't rewrite the metadata files.
What do we want to change in the json? Is it only the path of the table, or we have to rewrite something else as well? Could we use something like the LocationProvider (which generates new datafile locations) to make the path resolution pluggable and store only the config in the table?
Need to change the location in metadata.json, or a combination of three table properties mentioned by @rdblue and @jackye1995.
Thanks @jackye1995 's proposal. It makes sense to have a centralized place to get the root for data and metadata since there are multiple ways to specify their locations. We can also hide the logic of how to prioritize them. cc @anuragmantri.
Why just get the root for the data/metadata from the LocationProvider?
Why not expose these methods instead:
// initialize the provider with the root paths and the `useRelativePath` config,
// or have different implementations like:
// - AbsoluteLocationProvider,
// - RelativeLocationProvider?
LocationProvider(Properties tableProperties);
// generate the absolute path for the metadata files
Path metaDataLocation(String file);
// generate the absolute path for the metadata files
Path dataLocation(String file);
// generate the value we put into the avro files - absolute or relative path
String metaDataKey(Path metadataFile);
// generate the value we put into the avro files - absolute or relative path
String dataKey(Path metadataFile);
Catching up on the comments from past couple of days. Thanks everyone for providing inputs.
@flyrain - Although it will increase the scope of this design to include bi-directional replication, we should consider covering that here.
Overall, I agree with @jackye1995's proposal. Thanks for the detailed explanation.
Then these 4 properties are used to get 2 absolute root paths:
root path to write data: the LocationProvider would read some fields in this list to determine the base root path, and generate absolute paths to write data files. root path to write metadata: similarly, the TableOperations.metadataFileLocation reads some fields above to determine the base root path, and generate paths for metadata files.
In the initial design doc, my thinking was the writes mentioned above need to change at all. However, reading the comments from @pvary above, we may indeed have to store a root
location somewhere for cases where we want to read inaccessible paths. Whether we want to have explicit method to set this root before accessing a table or have a mechanism for replicated tables to register themselves as replicas and then use a retry logic can be discussed.
My takeaways from this are:
Metadata writers need to know a) Relative path boolean b) True root of the table c) Metadata path if set Additionally, they will also convert paths to relative in metadata files.
Data writers need to know a) Data path if set b) True root of the table
An update transaction will change all the above except relative path boolean. Please correct me if my understanding is incorrect.
Why not expose these methods instead:
// initialize the provider with the root paths and the
useRelativePath
config, // or have different implementations like: // - AbsoluteLocationProvider, // - RelativeLocationProvider? LocationProvider(Properties tableProperties);// generate the absolute path for the metadata files Path metaDataLocation(String file); // generate the absolute path for the metadata files Path dataLocation(String file);
// generate the value we put into the avro files - absolute or relative path String metaDataKey(Path metadataFile); // generate the value we put into the avro files - absolute or relative path String dataKey(Path metadataFile);
@pvary - This may make the code change larger. Other than that, I don't see why this cannot be done unless I'm missing something.
One more use-case which also means that we have to manipulate path in the snapshot/metadata files is when we would like to enable temporary access for Iceberg tables stored on blobstores. In this case we have to create presigned URLs to access the specific files and we have to use this URL in the metadata of the table
The design doc has been updated to add initial support for relative paths and multiple root locations for a table. Please review the design.
can we use migrate_table stored procedure for this?
@asheeshgarg, no, the procedure is mainly for migrating existing Hive or Spark tables to Iceberg, https://iceberg.apache.org/docs/latest/spark-procedures/#table-migration.
I don't think it is a good idea in general to use relative paths. We recently had an issue where using a
hdfs
location without authority caused a user's data to be deleted by theRemoveOrphanFiles
action because the resolution of the table root changed. The main problem is that places in Iceberg would need to have some idea of "equivalent" paths and path resolution. Full URIs are much easier to work with and more reliable.But there is still a way to do both. Catalogs and tables can inject their own
FileIO
implementation, which is what is used to open files. That can do any resolution that you want based on environment. So you could use an implementation that allows you to override a portion of the file URI and read it from a different underlying location. I think that works better overall because there are no mistakes about equivalent URIs, but you can still read a table copy without rewriting the metadata.
@rdblue We tried injecting own FileIO via spark config which will replace the table/metadata path prefix with new location, this works till reading table metadata but while reading parquet files it struck in BatchDataReader.java in following function, Please provide your thoughts on this if there is some other way of achieving this.
protected CloseableIterator
// update the current file for Spark's filename() function
InputFileBlockHolder.set(filePath, task.start(), task.length());
Map<Integer, ?> idToConstant = constantsMap(task, expectedSchema());
/* below given code line is causing issue because its searching name given in metadata in the map which will be replaced by custom FileIO, changing this line to InputFile inputFile = table.io().newInputFile(filePath); is making it work but this remove the encryption logic, in short we couldn't make it work with only Custom FileIO /
InputFile inputFile = getInputFile(filePath);
Preconditions.checkNotNull(inputFile, "Could not find InputFile associated with FileScanTask");
SparkDeleteFilter deleteFilter =
task.deletes().isEmpty()
? null
: new SparkDeleteFilter(filePath, task.deletes(), counter());
return newBatchIterable(
inputFile,
task.file().format(),
task.start(),
task.length(),
task.residual(),
idToConstant,
deleteFilter)
.iterator();
}
Hey any news related to this subject? This is relevant for us and would guess for other users when trying to introduce alluxio FS.
We would like to have a load process that would write to object storage (table A pointing to fs://location) and then create a new table table pointing to alluxio fs mounted in the object storage (table A1 pointing to alluxio://location) and read from the same files. The problem today is even if we get the new table location in alluxio the metadata files and manifest files will be pointing to the writing FS an not use the relative path from alluxio to get the files via alluxio and leverage the cache. Instead the files will be read from the FS ignoring alluxio, only using it to get the 1st metadata file.
Does this makes a case for the relative path for manifests and metadata files?
Or do you think there is another way to achieve such architecture?
@jotarada yes, relative path should help with this use case
However, you don't really need relative paths if you use the same paths in alluxio. you can just configure the s3 endpoint config to use alluxio endpoint and read from alluvia, you don't even have to create table A1
table A fs: fs://location/db/tableA/data/metadata alluxio: fs://location/db/tableA/data/metadata
set s3 endpoint to alluxio for read, in spark set 'spark.fs.s3a.endpoint' and that should read from alluxio instead of s3, no need for new table or relative paths.
@abmo-x Can i do that on trino as well?
Any updates on this?
Hi @jotarada, unfortunately, I had to move on to other priorities and could not continue working on this issue. If you are interested, feel free to contribute, I will be happy to help review changes and provide any other support.
@jotarada We encountered a similar issue when we switched S3 vendors, necessitating the copying of all data to another bucket, resulting in unreadable iceberg tables. Currently, we have two options: either attempt to rewrite metadata with custom code or inject custom file IO. We have experimented with both approaches, and I can provide you with our code for reference. https://github.com/ksmatharoo/IcebergCustomFileIO https://github.com/ksmatharoo/IcebergMetadataRewrite
I'm trying to use Cloudflare R2 Sippy in front of S3, and I also run into this issue. Snowflake uses a full absolute path to S3 when generating manifest and when I try querying the Iceberg table on Duckdb via its own R2 integration, it still goes to S3.
I also tried downloading the files from S3 locally and query them in DuckDB and Clickhouse and they both try to pull the data from S3 while the copies exist locally. Unfortunately rewriting the metadata is not an option as we don't write the data but provide a read-only solution.
I also tried downloading the files from S3 locally and query them in DuckDB and Clickhouse and they both try to pull the data from S3 while the copies exist locally.
That's right. The metadata files have absolute paths pointing to s3 locations. We need to revisit the relative path approach for your use case.
We also face the same issue.
For anyone who is interested: https://github.com/lightmelodies/iceberg-relative-io
I'm primarily interested this as a means to enable unit testing; pyicberg seems to be able to read and write relative paths of the form:
my-warehouse/...
file:my-warehouse/...
file://my-warehouse/...
file://./my-warehouse/...
For anyone who is interested: https://github.com/lightmelodies/iceberg-relative-io
Refactored, now it should be very user-friendly with just one-line change.
spark.sql.catalog.test.catalog-impl org.apache.iceberg.hadoop.HadoopRelativeCatalog
Background
Iceberg specification captures file references for the following:
location
that determines the base location of the table.manifests
andmanifest-list
to determining the manifests that make up the snapshot.manifest_path
that identifies the location of a manifest file.file_path
that identifies a data file.file_path
identifies a data file on which position based delete is to be applied.All the file references are absolute paths right now.
Challenge
Table copy to another location could arise in the following use cases:
Absolute file references require the file references to reflect the new target location of the table before the table can be consumed.
Solution Option
We could support relative paths as follows:
location
shall always reference the absolute location of the tablelocation
from the table metadata.We might further consider splitting the Table metadata into two pieces:
format-version
table-uuid
location
sequence-number
current-snapshot-id
schema
This will ensure the following:
location
attribute in the table metadata.