ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
37.55k stars 6.9k forks source link

RFC: UUID 2.0 data types #63179

Closed rschu1ze closed 6 months ago

rschu1ze commented 6 months ago

ClickHouse supports UUIDs through a UUID data type and various utility functions to generate and convert UUIDs.

It has been noted that UUIDs in ClickHouse have no intuitive sort order, instead they are sorted by their right half. This makes UUIDs unsuitable/dangerous as sorting or primary index keys or partition keys. The reason for this behavior is historical: They are internally represented as a UInt128 (2 x 64 bit) composite integer (code), with the halves in big endian order (code).

The current UUID type also has the disadvantage that it treats all UUID versions equal (v1-v5 are standardized, v6-v8 are being standardized). This was okay in the past when ClickHouse only supported UUID version 4 but it makes it makes things difficult when we support version 7. More specifically,

  1. UUID-version-specific functions like UUIDv7ToDateTime cannot assume that the input is really in version 7 format.
  2. Generic UUID functions like empty (docs) may have different semantics for version 4 and version 7 UUIDs but there is currently no way to support that.

These problems can only addressed with a new UUID implementation:

UnamedRus commented 6 months ago

There is also problem with "awful" hash function, for UUID in uniq aggregate function https://github.com/ClickHouse/ClickHouse/issues/34425#issuecomment-1039074117

den-crane commented 6 months ago

The only concern is that Clickhouse is a DWH database and users may want to store uuids from different sources in the same column and also users may not know what their uuids are. Of course in this case they can use the old UUID. But I would consider the implementation of one new UUID type and name it nUUID or xUUID or ...

UnamedRus commented 6 months ago

I think, for new ClickHouse installations we need to use "new" variant. For older, users need to explicitly enable this setting.

alexey-milovidov commented 6 months ago

Original proposal:

Often we store UUIDs from external systems, and they could be both v4 and v7. There is already a problem that sorting is not as expected. I think the only way will be to introduce data types UUID1, UUID2 (or with better names) and when a user writes UUID in the table definition or while casting, it will be persisted as either UUID1 or UUID2, depending on a setting, and then we will enable UUID2 by default.

@rschu1ze for some reason your proposal is entirely different from the original.

Closing this because it does not make sense to have different data types for UUIDv4 and UUIDv7. It should be a single data type for UUID, but without these abominations.