DataFile External Identifier Field

mccheah commented 5 years ago

A URI locating a file may not be enough for file I/O implementations to construct InputFile and OutputFile instances, as proposed in https://github.com/apache/incubator-iceberg/issues/12. More specifically, consider a system where a file has some path, but that same path can be namespaced in different contexts. For example, the metadata for that same file can evolve over time, as we discussed in https://github.com/apache/incubator-iceberg/issues/16.

We propose adding another field called an ExternalIdentifier to the DataFile schema, which is an optional String tag allowing custom Iceberg consumers to look up the file in their system using their own unique identification mechanisms. This would allow such systems to look up the file directly by the identifier in addition to the path.

Alternative representations for the ExternalIdentifier that would allow for richer representations could be a byte blob or a struct with some schema that's stored in the table properties. However those representations can encourage more arbitrary and uncontrolled use of the field which we probably want to avoid. String seems to be the safest option.

mccheah commented 5 years ago

@vinooganesh @yifeih

rdblue commented 5 years ago

For example, the metadata for that same file can evolve over time, as we discussed in #16.

The examples on that issue were compression codec, which is stored in the Avro and Parquet formats (and I assume ORC) and CSV delimiter. I don't think either of those is a compelling reason to add custom metadata.

This adds the idea of a path "namespaced" differently in multiple contexts. I don't get that. What do you mean?

vinooganesh commented 5 years ago

Hey @rdblue - quickly jumping in here. I think the mentality is that a file path as the sole identifier of a file may not suffice for every use case. Having an additional file identifier (independent of the physical path itself) would allow consumers of the system to both logically similar files and run operations on them. Specifically, let's say that I have something of a "source system" notion that I would want to persist on a per file basis. Having this state as an attribute on the File object itself would support this type of use case. Does that make sense?

rdblue commented 5 years ago

@vinooganesh, I don't really understand the use case. How would you use the identifier?

vinooganesh commented 5 years ago

So I see 2 uses for this: (1) Identifier shared across files - let's say that I have a bunch of files that make up a RDD that come from different systems (for example, let's say we're a bank and we have a bunch of customers from M&A, wealth management, etc..) and they each give us a list of their customers that we union together to make up the RDD. Let's say one of them is corrupt / doesn't work, and thus our RDD is in a bad state. Having this identifier would allow us to link the file to the source system the the file came from and allow us to talk to the data owners to remedy the issue.

(2) Unique identifier on a per file basis - In this situation, we simply want a way to retrieve some static information on a per file basis outside of the path itself. For example, I think of this as something like the Descriptor in the SSTable object in Cassandra (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/Descriptor.java#L62). The object does include the directory (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/Descriptor.java#L56), but also includes something like the FormatType (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/Descriptor.java#L62 - an enum for different SSTable formats).

Taking a step back, it does sound a bit like metadata, but I do think having this type of information somewhere is important. Cassandra kind of hacks around it by encoding things like the SSTable version in the name itself (the path contains it), but we don't really have a similar way to retrieve this type of information without this identifier.

rdblue commented 5 years ago

Sorry, but I'm still not getting why this is necessary.

For example #1, why wouldn't you just add a "source" column to the data? That way you could do something like DELETE FROM table WHERE source = "bad_source_id"

For example 2, why is a secondary identity required? Can the external system not identify files by path? What would happen to this metadata when data files are merged? I'm reluctant to add a user metadata field that prevents maintenance because we don't know how to fill it in.

rdblue commented 5 years ago

After talking with @vinooganesh, @yifeih, and @mccheah, we decided that the use case was to be able to hook into the logic that creates file paths. The partition information stored for each DataFile is the "logical" path for a file so the need is just to set the physical path. This generalizes the object storage vs folder storage feature, so it makes sense to allow TableOperations or FileIO to control data file paths.

Superseded by #55.

apache / iceberg

DataFile External Identifier Field #23