Closed CMCDragonkai closed 3 years ago
The above spec doesn't concern itself with deterministic IDs. That is IDs generated by hashing some value. The UUIDv3 and UUIDv5 specs are for deterministic IDs.
Incorporating deterministic IDs can be useful for dealing with https://github.com/MatrixAI/js-db/issues/1. Because we are exposing values to be keys, and they are not encrypted at rest, this means it can be a good idea to hash those values and ensure we have a deterministic ID. This would mean that we get a similar ID representation and base encoding even if the source of the ID is deterministic data.
Note that it also supports the usecase of namespaces. Namespaces sound like the machine ID that we have, and one would imagine the generation of a namespaced random identifier. Note that the UUID specs state that they use MD5 or SHA1. It sounds like UUIDv5 would override any usage of UUIDv3. See this for more info: https://stackoverflow.com/a/28776880/582917
According to https://www.sohamkamani.com/uuid-versions-explained/ it would make sense that UUIDv5 is taken over.
And again, the main reason not to use UUIDv1 is because we want the higher amounts of entropy, which does mean our IDs are alot bigger. Or our compound IDs are used instead.
We are probably not going to follow the UUID versions strictly but instead create our IDs for our 4 separate usecases above and additional usecases as we see fit.
One of the things that would be nice is if we can configure whatever ID generated to be a composition of different properties. But somethings will be mutually exclusive.
The uuid library supports writing to a given buffer. However the buffer must be a typed array or a Node buffer, not an ArrayBuffer. This makes sense, since ArrayBuffer cannot be written to directly, you must use a typed array. In this case, it's always a Uint8Array which will be more portable than using Node buffers.
Instead of expecting a CSPRNG, you can ask the user to supply a random generator:
import * as uuid from 'uuid';
uuid.v4({
// 16 random bytes (it's actually 16 random numbers)
// could change to 16 random bytes since we want to be buffer/binary native
// and that's how most CSPRNGs would work
rng: () => [ 0x10, ... ]
});
The API of uuid in some situations makes use of "array of numbers" as a representation of bytes. I think this seems like a vestigial feature from the olden days of lacking proper binary types. Now with typed arrays, we shouldn't need to support this representation. Since we're going to be using this with LevelDB in js-db, and that is always buffers, we can focus entirely on buffer types instead of array of numbers.
The text representation of uuids have dashes, this seems intended to help human readability.
So it seems that even this strict adherence is not necessary, this shows equivalent base encodings.
Ascii: 3F2504E0-4F89-11D3-9A0C-0305E82C3301
Base64: 7QDBkvCA1+B9K/U0vrQx1A
Ascii85: 5:$Hj:Pf\4RLB9%kU\Lj
Seems like we would also be able to support multibase representations in this library too. https://github.com/multiformats/js-multiformats
Our preference I believe is base58btc, which means a bit longer, but easier to double click and copy.
Here's an example of using multibase:
const idBuf = new Uint8Array(16);
uuid.v5(
'abc',
'c1b00cc0-25a3-11ec-be9b-033334f5bcba',
idBuf
);
const idEnc = base58btc.encode(idBuf);
const idEnc2 = base64.encode(idBuf);
const ds = Object.values(bases).map((c) => c.decoder).reduce(
// @ts-ignore
(d1, d2) => d1.or(d2)
);
const idBuf_ = ds.decode(idEnc);
const b1 = Buffer.from(idBuf);
const b2 = Buffer.from(idBuf_);
console.log(b1.equals(b2));
console.log(idEnc);
console.log(idEnc2);
Note you can see here that we are using UUIDv5 to write a 16 byte UUID. This is then encoded as base58btc using multibase. Then to decode it, we combine all the base codecs together and reduce it into a combined decoder. This has a type error due to: https://github.com/multiformats/js-multiformats/issues/121 but it works fine.
If the base encoded string is not recognised, the exception is:
throw Error(`Unable to decode multibase string ${JSON.stringify(text)}, ${this.name} decoder only supports inputs prefixed with ${this.prefix}`)
Or:
throw Error('Can only multibase decode strings'
So any exception thrown here can be considered a parse error.
The multibase encoded string looks like:
zQMqkjqqpU32Ke7GWFCkTcK
Which has a much more constrained alphabet compared to base64, and we always have the z
prefix in front to indicate base58btc as opposed to other base encodings like base64 which would use m
.
I wonder whether to integrate multibase directly here in ID or leave it out for user to integrate outside in case they are likely to use it elsewhere. This would then ensure that this library focuses on binary representations and leaves the string encoding to the user.
One of the issues with using a CSPRNG, is that acquiring data may be an asynchronous operation. This would mean that the ID generation would be asynchronous as well.
Note that sometimes there are synchronous or asynchronous APIs, both are possible: https://nodejs.org/api/crypto.html#crypto_crypto_randombytes_size_callback
However webcrypto API specifies synchronous CSPRNG.
crypto.getRandomValues(typedArray)
crypto.randomUUID()
And our usage of node-forge would mean that we have synchronous and asynchronous forms:
async function getRandomBytes(size: number): Promise<Buffer> {
return Buffer.from(await random.getBytes(size), 'binary');
}
function getRandomBytesSync(size: number): Buffer {
return Buffer.from(random.getBytesSync(size), 'binary');
}
So our random IDs should then be both synchronous and asynchronous. That is we may return the raw data or return a promise on the data.
Note that CSPRNG and PRNG are all random number generators. CSPRNGs are better than PRNGs. But this is all pluggable, and for PK's purposes we are going to always package with a CSPRNG.
I've been investigating the uuidv7 functionality. It is claimed to be lexicographic order. However I'm not sure if this is the case when used as a bytes and without encoding.
When I used lexicographic-integer https://github.com/substack/lexicographic-integer, their byte representation was also in lexicographic order, so I didn't have to hex encode to get the lexicographic order and leveldb supported buffer keys.
// lexi.pack(number) => Array<number> - this converts a number into an array of number byte format
// Buffer.from(Array<number>) => Buffer -> this converts the array of numbers into a binary buffer
Buffer.from(lexi.pack(100))
// the result is usable as a leveldb key
This was proven by several tests that I did in js-db. So I'm going to port over uuidv7 and test if its byte representation is still lexically ordered.
Ok I've reviewed the UUIDv7 and UUIDv8 specs. It looks like the UUIDv8 spec would the most suitable to handle all of our usecases, and it would allow us to flexibly define and change the id structure.
The structure is:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| timestamp_32 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| timestamp_48 | ver | time_or_seq |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var| seq_or_node | node |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| node |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
And it is possible to potentially truncate the public key id to fit entirely in the uuidv8.
What ids are actually are in a shared context? In most cases you're going to always look up the node id before you look up the individual ids like vault id. Unless vault id were to be preserved across keynodes. It doesn't seem like you need to preserve it entirely if the vault remotes are used.
While vault ids may not need to be "shared" anymore.
I reckon claim ids may still be if we want to be able to refer to them.
Anyway we can make this optional to fill in the node part of the UUID or they are just CSPRNG filled random data.
There's no reference implementation of UUIDv8 atm, so we will just have to start with UUIDv7 reference implementation and port it over.
So here's some interesting issues with timestamps.
JavaScript has a performance.now()
and performance.timeOrigin
API.
This means:
import { performance } from 'perf_hooks';
// returns the number of milliseconds and microseconds in floating point
console.log(performance.now()); // 1206298.1301128864 => 1206298 milliseconds and 0.1301128864 milliseconds
// to convert to microseconds multiply by 1000
// 1206298130.1128864 microseconds
// 1206298130112.8864 nanoseconds
// 1206298130112886.4 picoseconds
// will be equivalent to Date.now() (but with extra precision of microseconds)
// there's not enough bit space to carry the nano seconds
console.log(performance.timeOrigin + performance.now());
However some browsers will limit this to 1 millisecond resolution. It seems chrome and nodejs doesn't limit it. Also it seems there's way more than microseconds here. It seems to go to all the way to picoseconds. I don't see any documentation describing why this is the case. I think in this case we may have to stick with milliseconds to be portable across browsers for the library, or stick with microseconds if we stick with Nodejs. But milliseconds might be better idea since we can share the library later.
It may actually be because JS floating point numbers have limited bit space, and thus this is the maximum largest floating point number: 1633483847712.9163
or 1.3249872394872892
which both are 17 digits. And thus when adding performance.timeOrigin + performance.now()
we only get microseconds, or the 10 millionth number, but not the billionth number which would be nano seconds. JS also now has bigint too but it's not part of this standard.
The usage of microseconds, in a 64 bit timestamp would be able to reach 584542 years according to https://en.wikipedia.org/wiki/Microsecond: (2**64)/(1e6*60*60*24*365.25)
. Which is quite a lot of space, and unix time stamps would start from 1970 anyway.
The performance.now()
is always monotonic apparently. Let's see what the spec says:
The time values returned when calling the now() method on Performance objects with the same time origin MUST use the same monotonic clock that is monotonically increasing and not subject to system clock adjustments or system clock skew. The difference between any two chronologically recorded time values returned from the Performance.now() method MUST never be negative if the two time values have the same time origin.
You see that it says "MUST never be negative", it doesn't say it's not allowed to be 0
. Which means that it is monotonic, but not strictly monotonic, which would disallow 0
. Meaning it is possible to have the same time returned. This is confirmed by running: console.log(performance.now(), performance.now());
on firefox which returns the same time. But this gives us at least half of the strictly monotonic properties.
However by itself this is not enough, we want to this be aligned with "wall clock time" as well, so we have some rough idea of when these events themselves have happened. And if we want to align it, we need to use performance.timeOrigin
to add onto it.
The time values returned when getting performance.timeOrigin MUST use the same global monotonic clock that is shared by time origins, is monotonically increasing and not subject to system clock adjustments or system clock skew, and whose reference point is the [ECMA-262] time definition - see § 7. Privacy and Security.
The spec says it is monotonic, bu this doesn't seem to be the case, how would browsers or node runtime get a monotonic clock? Oh yea it is still bounded by when the process is restarted.
So I can imagine, we save the performance.timeOrigin
, or whatever the last time is when we created a counter. Then if when we start again, and we read performance.timeOrigin
, we see that it is lower than the saved time, we use the saved time as the reference point instead. If it is the same time, then we may increment by plus 1 at the lowest resolution. That means:
performance.timeOrigin
The usage of the sequence is interesting as that it makes the ID monotonic, but a slight tweak to the above algorithm would allow one to always create strictly monotonic clock all the time and one could just use that.
See also: https://github.com/uuid6/uuid6-ietf-draft/issues/41#issuecomment-907517063
Another issue is the year 2038 problem, which if we only use 32 bits and only for seconds https://en.wikipedia.org/wiki/Year_2038_problem then our timestamp would not work. Given that we intend to use milliseconds or even microseconds we are likely to run out even more quickly, so we would need to use alot more bits. The UUIDv8 scheme allows 32 bits, then 16 bits, then 12 bits for a total of 60 bits. Then the seq_or_node
provides another 8 bits. I reckon 60 bits should be enough to store milliseconds:
60 bits and 1e3 for milliseconds
(2**60)/(1e3*60*60*24*365.25)
= 36533877.8807 years
I reckon even 48 bits would be fine here.
48 bits and 1e3 for milliseconds
(2**48)/(1e3*60*60*24*365.25)
= 8919.40377946 years
If we use microseconds, then 48 bits is not enough, 60 bits is better.
60 bits with 1e6 microseconds
(2**60)/(1e6*60*60*24*365.25)
= 36533.8778807 years
If we use 64 bits, we have 4 bits left for sequence. 4 bits for sequence from 0 to 15, that is 16 numbers. If we use 60 bits, we have 8 bits left, that's 256 numbers. Anyway milliseconds at 48 bits would mean we have 12 bits for sequence, and the later 8 bits are just random or part of the node identifier.
So for now I reckon we use:
I might have js-id
wrap uuid
library so we can make use of UUIDv4 for fully random and UUIDv5 for deterministic and end up implementing UUIDv8 here, or a variant of UUIDv7 from the uuidv7 reference impl https://github.com/kripod/uuidv7. Then abstract their APIs to suit PK usage. Such as focusing on binary forms, and then applying multibase on top or expecting the user to do this.
Note that webcrypto is only available on Node v15 and above. The LTS version of NodeJS is still v14, and doesn't support it, and our @types/nodes
is on v14 as well. Otherwise we would be able to:
import { webcrypto } from 'crypto';
const data = new Uint8Array(8);
webcrypto.getRandomValues(data);
In browsers, this is just crypto.getRandomValues(data);
because crypto
already just exists as the webcrypto object.
A simple feature test would be to test if crypto
exists as an object using the globalThis
which is standard across all platforms.
// this is true for Node and Browsers, Node has its crypto object it's not webcrypto though
'crypto' in globalThis
// on node v15 and greater
crypto.webcrypto.getRandomValues
So best test for whether we can just use the webcrypto API would be:
// globalThis is expected to exist, but if not then use globalThis?.crypto
if (typeof globalThis.crypto?.getRandomValues === 'function') {
// use crypto.getRandomValues();
} else {
// make sure to bring in some default or something else to bring in random values
}
Proposed API for @matrixai/id
. We will use generator and async generator interface. This indicates the idea that generating an id is part of a generator context, and thus reduces the likelihood of confusing side effects, and ensures that one can pass CSPRNG or clock source from the beginning.
import id from '@matrixai/id';
const idGen = id({/* ...options */});
const id1 = idGen.next();
const id2 = idGen.next();
const id3 = idGen.next();
// if we pass in asynchronous CSPRNG or source of information
const idGenAsync = id({async: true, /* ...options */});
await idGenAsync.next();
await idGenAsync.next();
Alternatively we can use ES6 classes with new ID()
to return something that supports both synchronous and asynchronous iteration like in https://stackoverflow.com/questions/28739745/how-to-make-an-iterator-out-of-an-es6-class/28744141
import Id from '@matrixai/id';
const idGen = new Id();
idGen.next();
await idGen.next();
// i'm not sure if it's possible to have a class with both synchronous and asynchronous iteration?
Then we can support 3 kinds of ids:
The generators will be independent from each other, so properties like monotonicity are not preserved across generators. One has to use the same generator instance. Except during process restart where we would put in the last timestamp created. Since the timestamp is internal data, one might instead ask for the last ID generated, and if you pass that in, it will ensure that the next ids are always ahead of the last id.
Playing around with this protocols of iterable, async iterable, iterator and async iterator, I can see that it's not possible to have an object be both an async iterator and iterator. The async iterator interface demands that next
returns a promise. While iterator interface demands that next
returns a non-promise. It is however possible to implement both iterable and async iterable interfaces, since these are 2 different symbols Symbol.iterator
and Symbol.asyncIterator
.
It seems like our Id
could be generic. That is:
class Id<T = Iterator<IDBuffer> | AsyncIterator<IDBuffer>> {
next(): T {
}
}
But it seems easier to just create 2 different classes to do this, plus the generator syntax simplifies alot of this, so a function construction might be better.
Example usage:
function * id () {
while (true) {
const idData = new ArrayBuffer(16);
uuid.v4({}, new Uint8Array(idData));
yield idData;
}
}
function toUUID(idData: ArrayBuffer): string {
return uuid.stringify(new Uint8Array(idData));
}
function main () {
const idGen = id();
console.log(idGen.next());
console.log(idGen.next());
console.log(idGen.next());
const u = idGen.next().value as ArrayBuffer;
console.log(u);
console.log(toUUID(u));
}
main();
The ArrayBuffer
is just raw memory. This does mean that idGen.next()
will give back { value, done }
and one has to extract the value all the time which might be a bit annoying instead of always just using the value. That is, in our case, our generator will never return void
, it will always keep generating. So it's more infinite. In that case, it might be better to have a function instead of generator interface. The generators do however make it easier to to get multiple ids easily like for..of
and for await...of
.
Maybe even spread but limited like [...take(idGen, 5)]
.
It seems like we would want something like idGen.get()
to give us an id if it exists, and if done, it should just fail by throwing an exception. That way you maintain the generator, but also add additional method to just get the actual value. That way you preserve the generator interface, but also a simpler method to just get the id value directly since we know it will never end.
function *take<T>(g: Iterator<T>, l: number): Generator<T> {
for (let i = 0; i < l; i++) {
const item = g.next();
if (item.done) return;
yield item.value;
}
}
class IdSync implements IterableIterator<ArrayBuffer> {
get(): ArrayBuffer {
return this.next().value as ArrayBuffer;
}
next(): IteratorResult<ArrayBuffer, void> {
const idData = new ArrayBuffer(16);
uuid.v4({}, new Uint8Array(idData));
return {
value: idData,
done: false
};
}
[Symbol.iterator](): IterableIterator<ArrayBuffer> {
return this;
}
}
const idGen = new IdSync();
console.log([...take(idGen, 5)]);
The above creates a class that acts like generator. But the generator type has return
and throw
which we won't need here. So we have simplified the above. And that would give us the ability to create different kinds of "id generators", that can fit the above properties.
It turns out trying to use top level await to do something like
let randomSource: (size: number) => Uint8Array;
if (typeof globalThis.crypto?.getRandomValues === 'function') {
randomSource = (size) => globalThis.crypto.getRandomValues(new Uint8Array(size));
} else {
const crypto = await import('crypto');
randomSource = (size) => crypto.randomBytes(size);
}
let timeSource;
if (typeof globalThis.performance?.now === 'function' && typeof globalThis.performance.timeOrigin === 'number') {
// use globalThis.performance.now() and globalThis.performance.timeOrigin
} else {
const { performance } = await import('perf_hooks');
// use performance.now() and performance.timeOrigin
}
Is quite difficult.
It requires some big changes:
type: "module"
in package.json
module: "es2022"
target: "es2020"
moduleResolution
and also resolveJsonModules
import x from './x.ts';
needing the file extension which I thought we don't use cause the extension may changeIt's just not ready yet. But soon one day it can work.
So for now, the default randomSource
is going to come from NodeJS crypto instead.
Another thing is that libraries generally should leave the browser bundling to the platform user. That keeps library simple so they don't have to have polyfills and assume too much, and one can always use different libraries for different platforms.
Also exploring the webcrypto API it appears acquiring random bytes is always synchronous. So in this case we will stick with an synchronous API here.
We can do this now:
import { IdRandom, IdDeterministic, IdSortable, utils } from '@matrixai/id';
The IdSortable
is the last one I'm implementing. The IdRandom
and IdDeterministic
are all derived from uuid
library.
Right now we cannot really make this library isomorphic, so it is Node specific for now (until we figure out how to write isomorphic libraries, or browserify when we need it).
The APIs all return ArrayBuffer
. But utils
has utils.toUUID()
which converts to the hex encoded UUID format.
Additionally we have the multibase
library to allow encoding the ArrayBuffer
subsequently in any multibase format, thus enabling multiple equivalent IDs, and perhaps a way to parse that back to the ArrayBuffer
. Users are expected to use Uint8Array
when referring to it.
When we use it in PK DB, it will need to be wrapped as Buffer
keys with Buffer.from(ab)
.
BTW apparently ArrayBuffer
actually works for leveldb, it's just not documented: https://github.com/Level/level-js/issues/34#issuecomment-397811192
Might make type changes in js-db to test this and then it will be easier to use js-id without converting.
After reviewing everything, I'm changing IdRandom
to use UUIDv7. It's actually sufficient for what we want to do.
It uses 36 bits for unixts
with seconds. This is (2**36)/(1*60*60*24*365.25) = 2177.58881334
. This is the number of years beyond 1970. Which means roughly the year 4147.
This is the structure then:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unixts |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|unixts | msec | ver | seq |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var| rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
I'm hitting a problem for how best to write 36 bits into the array buffer. Some notes here: https://github.com/uuid6/uuid6-ietf-draft/issues/11#issuecomment-937423331
It seems like it should be possible to left shift 12 bits, assuming that we have 64 bit integers... but may be we don't actually have 64 integers in JS.
> n = BigInt(1633579325)
1633579325n
> n
1633579325n
> n << 12
Uncaught TypeError: Cannot mix BigInt and other types, use explicit conversions
> n << 12n
6691140915200n
> n << 12n >> 12n
1633579325n
Ok I have a python script that I can use as reference. It seems my bigint idea above isn't working.
Ok so first point of different is the use of fixed point numbers instead of floating point numbers. This PDF explains how to convert a floating point number to a fixed point number:
http://ee.sharif.edu/~asic/Tutorials/Fixed-Point.pdf
This is used in the part of the script that converts the f
fractional part of a second into subsec
which is done by:
f = get_fractional_part(t, subsec_decimal_digits)
subsec = round(f * (2 ** subsec_bits))
Where subsec_bits
is 10
for ms resolution and subsec_decimal_digits
is 3.
So f
is something like: 0.347
meaning 0.347
seconds, this was rounded to 3 digits from the nanosecond timestamp.
This was then converted to 355
by round(0.347 * (2 ** 10))
.
Why is subsec_bits
10 for ms? It's the binary digits after the "binary point" that is needed to represent ms.
The upstream issue (https://github.com/uuid6/uuid6-ietf-draft/issues/11#issue-824103978) says:
Multiply the fraction of sub-seconds by 2^n, where n is the number of bits that fits to the system time precision. If the precision is millisecond, n is equal to 10 (2^10 = 1,024). If the precision is microsecond, n is equal to 20 (2^20 = 1.048.576) and so on;
And also:
unixtime: a 90-bit field reserved for the Unix timestamp. This field is encoded as a fixed-point number in the form fixed<38,n>, where 38 is the number of bits before the binary point, and n is the number of bits after the binary point. The size n is arbitrary so that it can be defined by the system that generates the UUID. The bits not used in this field can be filled with a sequence counter and/or some random bits, for example. The timestamp can represent dates up to the year 10680 A.D. (2^38/60/60/24/365.25 + 1970).
So here we see that this is earlier then the latest draft of the spec. The idea is to use a fixed point number with <38, n>
where n
is the number of bits reserved for the fractional portion.
The current state of the spec says that <36, n>
instead where it is <36, 12>
all together being 48 bits. Meaning we should use 12
as the subsec bit size.
I wonder what the advantages of using fixed point number here instead of just using the truncated subseconds? Maybe it's more accurate to do it this way?
And now I find the up to date reference implementation in python: https://github.com/uuid6/prototypes/blob/main/python/new_uuid.py.
And yes, they are going ahead now with 36 bits there, and with the fixed point number system for the subsecond sections.
The python mechanism supports nanoseconds. Due to JS cross-platform limitations we are stuck with milliseconds for now even though other parts of this system only work in NodeJS for now. (Nodejs technically supports nanoseconds via hrtime).
Ok so the in the new_uuid
implementation, they rely on Python's ability to easily create bitstrings.
That is:
sec = 1633608789
unixts = f'{sec:032b}'
print(type(unixts)) # <class 'str'>
print(len(unixts)) # 32
The unixts
is literally a string:
'01100001010111101110010001010101'
It is 32 characters long, and it is 0-padded from the left. The :032b
basically creates a 32 long character long string with 0 padding (without the 0
it is just empty spaces).
The Python language can use bit strings, and turn them back into numbers with: int(unixts, 2)
which gives us back the sec
at 1633608789
.
Therefore when they finally put together all the bits at:
UUIDv7_bin = unixts + subsec_a + uuidVersion + subsec_b + uuidVariant + subsec_seq_node
They are actually just concatenating strings together, and at the end it can be turned into a integer OR hex encoded:
UUIDv7_int = int(UUIDv7_bin, 2)
UUIDv7_hex = hex(int(UUIDv7_bin, 2))[2:]
In JS, we can achieve a similar thing with parseInt
:
> parseInt('01100001010111101110010001010101', 2)
1633608789
So that seems a lot easier to do then having to fiddle with typed arrays.
However this reminded me of something, the bitset data structure that I've used before in js-resource-counter. The library is still maintained today as https://github.com/infusion/bitset.js and the concept of a data structure for bit arrays is located here: https://en.wikipedia.org/wiki/Bit_array. There's even faster libraries these days like https://github.com/SalvatorePreviti/roaring-node which are using compressed bitmaps, but overkill for this library.
In the wiki article you can see that bit level manipulations can with bitwise ops against numbers, in this case the number acts like the "word" structure. It's a bit more esoteric as proper understanding of the bitwise ops are needed.
I think it's sufficient to actually use bitstrings, we just need to create the equivalent of f'{sec:032b}'
in JS. And the answer is ehre: https://stackoverflow.com/a/16155417/582917
(sec >>> 0).toString(2).padStart(32, '0')
The >>> 0
is only needed to properly deal with negative numbers, however in our case we don't expect any negative numbers but it's harmless.
We can do typed array as an optimisation later when we have this working correctly and develop the proper test cases.
To convert a bitstring back to an array of numbers that can be put into a typed array, we may need to chunk it up to 8 bits each and convert to a number.
Also the act of converting a number to a base 2 string outputs in big endian form. This makes sense as the 0 padding occurs on the left.
So we can go with bitstrings first and then later in the future optimise to using typedarrays directly acting as a bitmap (without bringing in newer libraries).
The spec has been updated here: https://github.com/uuid6/uuid6-ietf-draft/blob/master/draft-peabody-dispatch-new-uuid-format-02.txt
But it's not published. So that should be followed.
There's a bug in the python reference implementation: https://github.com/uuid6/prototypes/issues/8
I'm now fixing that up in our JS port.
I've got a working test script now:
I've created a number of utility functions to help the creation of IdSortable
:
toUUID
bytes2hex
hex2bytes
bytes2bin
bin2bytes
dec2bin
dec2hex
strChunks
roundPrecise
toFixedPoint
fromFixedPoint
Additional utility functions maybe used to encode and decode to multibase but I'll leave that later.
Tests have been added for all of the above.
It's working so far. Some more needed to get it ready for prod.
Regarding the last id tracking, I reckon if the last id is earlier than the current time origin, then we should only use it as the origin plus 1 if the performance.now()
returns 0.
One of the cool things I just realised is that because the timeSource
is customisable, it's possible to create a logical clock (https://en.wikipedia.org/wiki/Logical_clock) as well to create sortable IDs that don't correspond to real time. It's worthwhile to create a test to demonstrate this.
The IdSortable
is implemented now, made changes to the timeSource
so that it only increments by 1 if the performance.now()
returns 0 when the lastId
is greater or equal to the current performance.timeOrigin
.
TS accepts Buffer
where functions require ArrayBuffer
. However this actually causes problems. We need to throw an exception if the data passed in is ArrayBuffer
. It's a bit of a silent error and difficult to debug because new Uint8Array(...)
when given a Node buffer doesn't properly create a truncated view.
For now I'll leave this as documented.
Multibase encoding and decoding has been added in.
However due to a problem involving jsdom: https://github.com/facebook/jest/pull/8022 I've had to upgrade the jest testing facilities.
This may impact downstream PK, so we have to update TypeScript-Demo-Lib and PK related testing.
- "@types/jest": "^26.0.20",
+ "@types/jest": "^27.0.2",
"@types/node": "^14.14.35",
"@types/node-forge": "^0.10.4",
"@typescript-eslint/eslint-plugin": "^4.12.0",
@@ -31,9 +31,9 @@
"eslint": "^7.17.0",
"eslint-config-prettier": "^7.1.0",
"eslint-plugin-prettier": "^3.3.1",
- "jest": "^26.6.3",
+ "jest": "^27.2.5",
"prettier": "^2.2.1",
- "ts-jest": "^26.4.4",
+ "ts-jest": "^27.0.5",
@joshuakarp
This is published now: https://www.npmjs.com/package/@matrixai/id
All done and ready for integration into PK.
Specification
ID generation is used in many places in PK. But the IDs must have different properties depending on the usecase.
The properties we care about are:
IDs compared to petnames give us the secure and decentralized properties, but not human-meaningful. Human meaningful names can be generated by mapping to a mnemonic. But that is outside the scope of this for now.
There are roughly 4 usecases in PK:
To resolve decentralised vs centralised, rather than assuming that the machine id should be embedded in the identifier, we would instead always expect the machine id to be appended as as suffix. Appending is superior to prepending to ensure that sorting is still done.
We can default to using 128 bit sizes, but allow the user to specify higher or smaller sizes.
We can use a default CSPRNG, but also allow users to submit a custom CSPRNG for random number generation.
To ensure monotonicity, we want to allow the external system to save a clock state and give it to us, so we can ensure that ids are always monotonic.
We may expect that our IDs to be later encoded with multibase, we should allow this library to be composed with multibase later.
Note that ID generation is different when it's meant to be backed by a public key. That is out side of the scope of this library. These IDs are not public keys!
There are places in PK where we use https://github.com/substack/lexicographic-integer, in those cases we may keep using that instead of this library. However those are when it is truly that we are trying to store a number like the inode indexes in EFS, in the sigchain, what we really want is
IdSortable
Additional context
Tasks
performance.now()
andperformance.timeOrigin
APIs (make it browser possible by testing with a dynamic import?)IdRandom
IdDeterministic
IdSortable
IdSortable
~ - created our own testsArrayBuffer
~ - can't do this because it doesn't work in jest