fboulnois / pg_uuidv7

A tiny Postgres extension to create version 7 UUIDs
Mozilla Public License 2.0
298 stars 25 forks source link

Use C code from pgsql-hackers #15

Open x4m opened 1 year ago

x4m commented 1 year ago

This code have several advantages:

  1. Implement counter per standart draft. This counter protects from collisions better than random numbers, when UUIDs are generated at high speed.
  2. Buffer randomness. This greatly improves speed of generation. Try select uuid_generate_v7() from generate_series(1,1E6);
  3. Avoid pghton(), because it does several unnecesary manipulations.
  4. Avoid initializing timestamp part with randomness. This helps to save some entropy. Make the world greener :)

The code is based on PG patch https://commitfest.postgresql.org/43/4388/ I have my repo https://github.com/x4m/pg_uuid_next , but I do not have energy to support it like you do. So I decided to bring some code from there to your implementation.

Have a nice day :)

fboulnois commented 1 year ago

@x4m , thanks for your PR! I also read your thread in the pgsql-hackers list.

A few thoughts:

  1. As I outline in my comments on the RFC, I don't really like the alternate pseudorandom constructions for version 7 UUIDs. From a philosophical perspective, one of the goals of this extension is to keep code as simple as possible so it is easy to reason through, and I think the alternate constructions add too much surface area to get things wrong. From a technical perspective, the counter uses up bits in the random data but I am not sure the tradeoff is worthwhile. With 72 bits of randomness you can generate billions of UUIDs every ms without collisions.
  2. This is smart! However, I wonder if the cache should live in the pg_strong_random layer instead? This is what other high quality cryptography libraries seem to do. That way, this would benefit every Postgres instance and application in the world. Talk about making the world greener!
  3. pg_hton64 is ultimately an alias for bswap64, which will get compiled on supported architectures as a single instruction. Are there benchmarks that show that manual bit swapping is faster?
  4. The truth is that I didn't want to fill the UUID at an arbitrary location to avoid buffer overflow type issues, but this is fair to point out.

All in all, I'm unlikely to add 1 or 3. If I benchmark the cache in 2 and it provides a large performance improvement, I may implement something similar. 4 seems reasonable if I can verify and test there aren't any off-by-one errors that would lead to buffer overflows.

Cheers.

x4m commented 1 year ago
  1. The idea behind a counter is about keeping serially generated UUIDs sorted. Randomly initialized counter still counts against global collisions, better prevents local collisions and keep data sorted for better ingestion speed. Indeed, 128 bits is a lot, enought to solve most of uniquiness problems. But using some of these bits for counter helps to solve unqieness problem even better, also proving better sortability.
  2. Postgres uses OpenSSL, though I did not benchmark the feature with this lib. When OpenSSL is absent simply "f = open("/dev/urandom", O_RDONLY, 0);" consumes all the time...
fboulnois commented 1 year ago

I'm still undecided on implementing something similar to 2. After several benchmarks the performance gain is about 10% on a Postgres install with OpenSSL. This is pretty good, but I still feel like the functionality probably shouldn't live in this extension. I'm going to continue considering it, however.

sergeyprokhorenko commented 5 months ago

@fboulnois Could you please publish Andrey Borodin's (@x4m) extension. Many people have been wating for his patch, but unfortunately the patch was recently suspended because of feature freeze for 17 version of PostgreSQL, just before the final approval of the new RFC by its authors.

fboulnois commented 5 months ago

@fboulnois Could you please publish Andrey Borodin's (@x4m) extension. Many people have been wating for his patch, but unfortunately the patch was recently suspended because of feature freeze for 17 version of PostgreSQL, just before the final approval of the new RFC by its authors.

At this point, the only substantive difference between x4m's patch and my extension is the alternate pseudorandom construction, which I've already discussed in https://github.com/fboulnois/pg_uuidv7/pull/15#issuecomment-1722562079 . Cheers!

x4m commented 5 months ago

I think @sergeyprokhorenko meant exactly this: let's add a counter or microseconds. The thing is if you do select i, gen_uuid_v7() id from generate_series(1,1000) i the order of i will be different from order of id. This is not super critical, but actually makes life better in many scenarios.

sergeyprokhorenko commented 5 months ago

I meant to add a counter 18 bits long, initialized every millisecond with a random number. The most significant bit of the counter is initialized to zero to prevent the counter from overflowing if the most significant bits of the random number are ones. All advanced implementations use a counter:

This promotes sortability, additional monotonicity and better database performance. It can also make it easier to find bugs by knowing the latest version of the record. The name of the function with counter can be the common uuidv7().

fboulnois commented 3 months ago

Different projects use different counter lengths. Why use 18? I see other projects use 14, 16, 24, 40, 42, and 48 bits for the counter size.

x4m commented 3 months ago

Just a sane number: 125MHz without overflow, tradeoff between max frequency and predictability. Detailed explanation was in hackers AFAIR.

sergeyprokhorenko commented 3 months ago

Different projects use different counter lengths. Why use 18? I see other projects use 14, 16, 24, 40, 42, and 48 bits for the counter size.

Too short a counter makes it possible for the counter to overflow under peak loads. This can be mitigated by using the timestamp as the counter in such a situation, but is intuitively undesirable. But a short counter theoretically allows for faster UUID detection if the UUID is compared over the timestamp + counter segment (this is not the case now) rather than over the entire UUID length.

With 18 bits, counter overflow is guaranteed to be impossible if the high-order bit of the counter is initialized to zero every millisecond, and if the remaining counter bits are initialized to a random number.

Too long a counter reduces the random segment of the UUID, and therefore worsen unguessability of the UUID by brute-force guessing. But a long counter theoretically allows for fewer resources to be spent on random number generation.

RFC 9562 recommends that the counter SHOULD be at least 12 bits but no longer than 42 bits.

sergeyprokhorenko commented 3 months ago

UUIDv7 generation function in ClickHouse DBMS

https://clickhouse.com/docs/en/sql-reference/functions/uuid-functions#generateUUIDv7

https://github.com/ClickHouse/ClickHouse/pull/62852

For the sake of monotonicity, when several microservices access the UUIDv7 generator in parallel, a single UUIDv7 generator is used per server.

sergeyprokhorenko commented 3 months ago

Just a sane number: 125MHz without overflow, tradeoff between max frequency and predictability. Detailed explanation was in hackers AFAIR.

See https://www.postgresql.org/message-id/84D20D0F-B5FF-41CD-9F48-E282CE9FEC1D%40yandex-team.ru

and https://www.postgresql.org/message-id/F91948DD-500A-4A22-ABB9-5F4C59C28851%40yandex-team.ru

The standard does not specify how big counter should be. PFA patch with 16 bit counter. Maybe it worth doing 18bit counter - it will save us one byte of PRNG data. Currently we only take 2 bits out of the whole random byte.