databendlabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.88k stars 751 forks source link

Feature: UUID V7 #16818

Closed k-bx closed 1 week ago

k-bx commented 1 week ago

Summary

Would be great to have UUIDv7 supported out of the box as the one that's "best of both worlds" in terms of big data IDs.

BohuTANG commented 1 week ago

@k-bx Are you referring to the uuid function? https://docs.databend.com/sql/sql-functions/uuid-functions/gen-random-uuid

k-bx commented 1 week ago

@BohuTANG this is UUIDv4, not UUIDv7. UUIDv7 is great because it beings with datetime and grows monotonically, potentially being suitable as a perfect cluster key and datetime filter.

BohuTANG commented 1 week ago

@k-bx Thanks for sharing the requirements. We've updated the UUID generation to use version 7 (v7) instead of v4 and have also updated our documentation : https://docs.databend.com/sql/sql-functions/uuid-functions/gen-random-uuid

k-bx commented 1 week ago

@BohuTANG thank you for a quick implementation. Since you're here, can you address two questions related to the feature:

  1. what's the easiest way to update my docker container to try this feature?
  2. how do I make sure that my data is effectively queried and clustered by both, ID (UUIDv4) and datetime? Most of my queries will have a filter by created_at column (datetime), and many will be by ID (id=XXX). What's the current suggested way to ensure Databend effectively queries both scenarios without going through full table? In theory you could extend query functions to extract created_at from the ID, but since it's not implemented, what is the recommented way?

Thank you!

BohuTANG commented 1 week ago

@BohuTANG thank you for a quick implementation. Since you're here, can you address two questions related to the feature:

  1. what's the easiest way to update my docker container to try this feature?

docker pull datafuselabs/databend:nightly

  1. how do I make sure that my data is effectively queried and clustered by both, ID (UUIDv4) and datetime? Most of my queries will have a filter by created_at column (datetime), and many will be by ID (id=XXX). What's the current suggested way to ensure Databend effectively queries both scenarios without going through full table? In theory you could extend query functions to extract created_at from the ID, but since it's not implemented, what is the recommented way?

@zhyass any suggestions?

k-bx commented 5 days ago

Also, is storing UUID as VARCHAR a recommended approach? I think Parquet supports UUID and would use the storage more optimally if there would be a separate type for it.

@zhyass would appreciate the answer to the datetime + UUIDv4 effective query question. Thank you!

zhyass commented 4 days ago

@k-bx Hello,

Also, is storing UUID as VARCHAR a recommended approach? I think Parquet supports UUID and would use the storage more optimally if there would be a separate type for it.

Yes, uuid can be stored as VARCHAR.

@zhyass would appreciate the answer to the datetime + UUIDv4 effective query question. Thank you!

For UUIDv4, it is recommended to define the cluster key as CLUSTER BY (created_at, id). You can adjust the granularity of created_at according to your specific requirements by using datetime functions. For instance, you can use CLUSTER BY (TO_YYYYMMDD(created_at), id) to adjust the granularity of created_at to a daily level. However, due to the randomness of UUIDv4, filtering by id might not be very effective.

For UUIDv7, and there is a correlation between id and created_at, you can define the cluster key as CLUSTER BY (id). Note that Databend only uses the first 8 bytes of a VARCHAR for clustering, which corresponds to a time granularity of up to the minute level for UUIDv7.

Here is the cluster key documentation: https://docs.databend.com/sql/sql-commands/ddl/clusterkey/

k-bx commented 4 days ago

@zhyass Thank you for the reply. However, I'd still want to clarify: for UUIDv7, if I do CLUSTER BY (id):

E.g. does Databend understand in this case that data is effectively split by created_at too?

Thank you in advance.

zhyass commented 3 days ago

Hi, the prerequisite for using cluster by(id) is that there is a correlation between id and create_at, i.e. the timestamp of id and create_at can be matched.

After defining the cluster key, you need to execute alter table xx recluster final to do recluster on the table (this process may be more time-consuming). You can also define task to do recluster on a regular basis.

  • will my queries WHERE id=XXX be efficient? (knowing which Parquet file this ID belongs to)

In case the table is well clustering, the data (id) in each Parquet file will have very low overlap, for WHERE id=XXX databend can filter out most of the files based on the MinMax Index. https://docs.databend.com/guides/performance/cluster-key

  • will my queries WHERE created_at > XXX be efficient? (knowing which Parquet files this created_at condition belongs to)

Because of the correlation between id and create_at, WHERE created_at > XXX can also be efficient.