delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.44k stars 1.67k forks source link

Provide sequence datatype to delta tables #242

Open parisni opened 4 years ago

parisni commented 4 years ago

One large limitation on spark/hive stuff over traditional datawarehouse is to provide auto-increment fields.

I wonder if delta could achieve this. One idea would be to store the actual max_value within the delta json file, and apply the monotonically_increment function while using the merge function or appending to delta files.

I guess having the goal to have monotonically increment meets the need of sequences (versus real sequential numbers).

More generally, sequence objects could be created and referenced within tables as postgresql does. MVCC design could theorically be applied for concurrent creation of values, having the table number of rows by advance.

Any previous work in this field ?

tdas commented 4 years ago

I dont think we have thought about this. This is a very hard problem because distributed generation of sequence numbers is fundamentally hard. In spark, monotonically_increment() generates monotonically generating numbers but contiguous. Furthermore, the numbers generated are not designed to stable across runs; for example, changes in the number of tasks, shuffle partitions, etc. will change the number. This makes the idea unsuitable for the on-the-fly generation of unique row numbers.

parisni commented 4 years ago

In spark, monotonically_increment() generates monotonically generating numbers but contiguous.

Not a problem: all we need for a sequence are distinct ids. eventually increasing/decreasing.

Furthermore, the numbers generated are not designed to stable across runs; for example, changes in the number of tasks, shuffle partitions, etc. will change the number.

No need for reproducible sequences. Primary keys are set one time and never changed by definition. Also concurrent write within table with sequence could be done sequentially and not concurrently to simplify the problem a lot.

Josh-BI-UK commented 6 days ago

Hi Delta Team,

Has there been any update on whether the sequence datatype feature for Delta tables will be progressed? This would be incredibly useful for ensuring unique, sequential IDs across distributed processes, similar to how sequences work in traditional SQL databases. Any insights or timelines would be greatly appreciated!

Mega luv Josh!