Closed k-bx closed 1 week ago
@k-bx Are you referring to the uuid function? https://docs.databend.com/sql/sql-functions/uuid-functions/gen-random-uuid
@BohuTANG this is UUIDv4, not UUIDv7. UUIDv7 is great because it beings with datetime and grows monotonically, potentially being suitable as a perfect cluster key and datetime filter.
@k-bx Thanks for sharing the requirements. We've updated the UUID generation to use version 7 (v7) instead of v4 and have also updated our documentation : https://docs.databend.com/sql/sql-functions/uuid-functions/gen-random-uuid
@BohuTANG thank you for a quick implementation. Since you're here, can you address two questions related to the feature:
Thank you!
@BohuTANG thank you for a quick implementation. Since you're here, can you address two questions related to the feature:
- what's the easiest way to update my docker container to try this feature?
docker pull datafuselabs/databend:nightly
- how do I make sure that my data is effectively queried and clustered by both, ID (UUIDv4) and datetime? Most of my queries will have a filter by created_at column (datetime), and many will be by ID (id=XXX). What's the current suggested way to ensure Databend effectively queries both scenarios without going through full table? In theory you could extend query functions to extract created_at from the ID, but since it's not implemented, what is the recommented way?
@zhyass any suggestions?
Also, is storing UUID as VARCHAR a recommended approach? I think Parquet supports UUID and would use the storage more optimally if there would be a separate type for it.
@zhyass would appreciate the answer to the datetime + UUIDv4 effective query question. Thank you!
@k-bx Hello,
Also, is storing UUID as VARCHAR a recommended approach? I think Parquet supports UUID and would use the storage more optimally if there would be a separate type for it.
Yes, uuid can be stored as VARCHAR.
@zhyass would appreciate the answer to the datetime + UUIDv4 effective query question. Thank you!
For UUIDv4, it is recommended to define the cluster key as CLUSTER BY (created_at, id)
. You can adjust the granularity of created_at
according to your specific requirements by using datetime functions. For instance, you can use CLUSTER BY (TO_YYYYMMDD(created_at), id)
to adjust the granularity of created_at
to a daily level. However, due to the randomness of UUIDv4, filtering by id
might not be very effective.
For UUIDv7, and there is a correlation between id
and created_at
, you can define the cluster key as CLUSTER BY (id)
. Note that Databend only uses the first 8 bytes of a VARCHAR
for clustering, which corresponds to a time granularity of up to the minute level for UUIDv7.
Here is the cluster key
documentation:
https://docs.databend.com/sql/sql-commands/ddl/clusterkey/
@zhyass Thank you for the reply. However, I'd still want to clarify: for UUIDv7, if I do CLUSTER BY (id)
:
WHERE id=XXX
be efficient? (knowing which Parquet file this ID belongs to)WHERE created_at > XXX
be efficient? (knowing which Parquet files this created_at condition belongs to)E.g. does Databend understand in this case that data is effectively split by created_at
too?
Thank you in advance.
Hi, the prerequisite for using cluster by(id)
is that there is a correlation between id
and create_at
, i.e. the timestamp of id
and create_at
can be matched.
After defining the cluster key, you need to execute alter table xx recluster final
to do recluster on the table (this process may be more time-consuming). You can also define task to do recluster on a regular basis.
- will my queries
WHERE id=XXX
be efficient? (knowing which Parquet file this ID belongs to)
In case the table is well clustering, the data (id) in each Parquet file will have very low overlap, for WHERE id=XXX
databend can filter out most of the files based on the MinMax Index.
https://docs.databend.com/guides/performance/cluster-key
- will my queries
WHERE created_at > XXX
be efficient? (knowing which Parquet files this created_at condition belongs to)
Because of the correlation between id
and create_at
, WHERE created_at > XXX
can also be efficient.
Summary
Would be great to have UUIDv7 supported out of the box as the one that's "best of both worlds" in terms of big data IDs.