ID standardisation for all projects and domains

CMCDragonkai commented 3 years ago

Specification

ID generation is used in many places in PK. But the IDs must have different properties depending on the usecase.

The properties we care about are:

Decentralised vs Centralised - the appending of a "machine ID" makes the IDs decentralised and prevents collisions and is coordination-free
k-Sortable vs Random - k-sortable IDs are sortable lexically which means it can be used in a ordered key-value database as an index, while random identifiers are intended to have no-order, and is often important to ensure unguessability, note that sortable IDs can also have a random component but there's a tradeoff in how random vs how sortable when there's limited amount of space
Limited byte size representation - UUIDs use 128 bits, it appears most IDs are 128 bits or 16 bytes, this means all properties have to be encoded within 128 bits, some properties may not fit within that 128 bits, for example decentralised ids may need to use the machine id which itself may be larger than 128 bits, and slicing a smaller amount would actually increase the likelihood of collision, in such cases compound ID formats will be required
Buffer Representation and Base Encoding - IDs should have a original binary form, and can then be encoded in various formats for display, buffer representation is superior as we can make use of the full bitspace, and it will be shorter compared to the base encoded for textual display, we could support different textual displays for different reasons, but the UUID display is a nice way of displaying the IDs which can help memorisation
Strict Monotonicity - for sortable IDs, it's essential that IDs generated are monotonic to prevent collisions or ambiguity, these have to be sortable across clock resets, and process restarts of the program

IDs compared to petnames give us the secure and decentralized properties, but not human-meaningful. Human meaningful names can be generated by mapping to a mnemonic. But that is outside the scope of this for now.

There are roughly 4 usecases in PK:

Decentralised Sortable IDs - claim ID in sigchain should be decentralized, but also sortable, they are public information so they point to a claim at a point in time
Centralised Sortable IDs - notification ID in PK notifications, these are IDs local to the PK, but should also be sortable, they are public information so they don't have to be cryptographically random
Decentralised Random IDs - vault ID, these should be random, as they should not leak the order of creation information
Centralised Random IDs - permission ID, these should be random because order doesn't matter, but leakage isn't really an issue either

To resolve decentralised vs centralised, rather than assuming that the machine id should be embedded in the identifier, we would instead always expect the machine id to be appended as as suffix. Appending is superior to prepending to ensure that sorting is still done.

We can default to using 128 bit sizes, but allow the user to specify higher or smaller sizes.

We can use a default CSPRNG, but also allow users to submit a custom CSPRNG for random number generation.

To ensure monotonicity, we want to allow the external system to save a clock state and give it to us, so we can ensure that ids are always monotonic.

We may expect that our IDs to be later encoded with multibase, we should allow this library to be composed with multibase later.

Note that ID generation is different when it's meant to be backed by a public key. That is out side of the scope of this library. These IDs are not public keys!

There are places in PK where we use https://github.com/substack/lexicographic-integer, in those cases we may keep using that instead of this library. However those are when it is truly that we are trying to store a number like the inode indexes in EFS, in the sigchain, what we really want is IdSortable

Additional context

Tasks

[x] - Play around with uuid library
[x] - Review uuidv7 and uuidv8 spec
[x] - Review https://github.com/kripod/uuidv7 implementation
[x] - Integrate multibase
[x] - Transform uuidv7 to uuidv8 constructor
[x] - Use uuiv8 as a skeleton to build our multi-property ids needed by PK
[x] - Integrate performance.now() and performance.timeOrigin APIs (make it browser possible by testing with a dynamic import?)
[x] - Consider both synchronous and asynchronous API due to asynchronous crypto CSPRNG
[x] - Implement IdRandom
[x] - Implement IdDeterministic
[x] - Implement IdSortable
[x] - Add tests for lexical order for both binary and encoded forms
~[ ] - Port tests over from https://github.com/uuid6/prototypes/tree/main/python for IdSortable~ - created our own tests
[x] - Add in multibase encoding utilities to allow easy way of encoding and decoding the ids using any base
~[ ] - Test that it is actually returning ArrayBuffer~ - can't do this because it doesn't work in jest

CMCDragonkai commented 3 years ago

The above spec doesn't concern itself with deterministic IDs. That is IDs generated by hashing some value. The UUIDv3 and UUIDv5 specs are for deterministic IDs.

Incorporating deterministic IDs can be useful for dealing with https://github.com/MatrixAI/js-db/issues/1. Because we are exposing values to be keys, and they are not encrypted at rest, this means it can be a good idea to hash those values and ensure we have a deterministic ID. This would mean that we get a similar ID representation and base encoding even if the source of the ID is deterministic data.

Note that it also supports the usecase of namespaces. Namespaces sound like the machine ID that we have, and one would imagine the generation of a namespaced random identifier. Note that the UUID specs state that they use MD5 or SHA1. It sounds like UUIDv5 would override any usage of UUIDv3. See this for more info: https://stackoverflow.com/a/28776880/582917

According to https://www.sohamkamani.com/uuid-versions-explained/ it would make sense that UUIDv5 is taken over.

And again, the main reason not to use UUIDv1 is because we want the higher amounts of entropy, which does mean our IDs are alot bigger. Or our compound IDs are used instead.

CMCDragonkai commented 3 years ago

We are probably not going to follow the UUID versions strictly but instead create our IDs for our 4 separate usecases above and additional usecases as we see fit.

One of the things that would be nice is if we can configure whatever ID generated to be a composition of different properties. But somethings will be mutually exclusive.

CMCDragonkai commented 3 years ago

The uuid library supports writing to a given buffer. However the buffer must be a typed array or a Node buffer, not an ArrayBuffer. This makes sense, since ArrayBuffer cannot be written to directly, you must use a typed array. In this case, it's always a Uint8Array which will be more portable than using Node buffers.

CMCDragonkai commented 3 years ago

Instead of expecting a CSPRNG, you can ask the user to supply a random generator:

import * as uuid from 'uuid';

uuid.v4({
  // 16 random bytes (it's actually 16 random numbers)
  // could change to 16 random bytes since we want to be buffer/binary native
  // and that's how most CSPRNGs would work
  rng: () => [  0x10, ... ]
});

CMCDragonkai commented 3 years ago

The API of uuid in some situations makes use of "array of numbers" as a representation of bytes. I think this seems like a vestigial feature from the olden days of lacking proper binary types. Now with typed arrays, we shouldn't need to support this representation. Since we're going to be using this with LevelDB in js-db, and that is always buffers, we can focus entirely on buffer types instead of array of numbers.

The text representation of uuids have dashes, this seems intended to help human readability.

https://stackoverflow.com/a/3557678/582917

So it seems that even this strict adherence is not necessary, this shows equivalent base encodings.

Ascii:   3F2504E0-4F89-11D3-9A0C-0305E82C3301
Base64:  7QDBkvCA1+B9K/U0vrQx1A
Ascii85: 5:$Hj:Pf\4RLB9%kU\Lj

Seems like we would also be able to support multibase representations in this library too. https://github.com/multiformats/js-multiformats

Our preference I believe is base58btc, which means a bit longer, but easier to double click and copy.

CMCDragonkai commented 3 years ago

Here's an example of using multibase:

const idBuf = new Uint8Array(16);
uuid.v5(
  'abc',
  'c1b00cc0-25a3-11ec-be9b-033334f5bcba',
  idBuf
);
const idEnc = base58btc.encode(idBuf);
const idEnc2 = base64.encode(idBuf);
const ds = Object.values(bases).map((c) => c.decoder).reduce(
  // @ts-ignore
  (d1, d2) => d1.or(d2)
);
const idBuf_ = ds.decode(idEnc);

const b1 = Buffer.from(idBuf);
const b2 = Buffer.from(idBuf_);

console.log(b1.equals(b2));

console.log(idEnc);
console.log(idEnc2);

Note you can see here that we are using UUIDv5 to write a 16 byte UUID. This is then encoded as base58btc using multibase. Then to decode it, we combine all the base codecs together and reduce it into a combined decoder. This has a type error due to: https://github.com/multiformats/js-multiformats/issues/121 but it works fine.

If the base encoded string is not recognised, the exception is:

throw Error(`Unable to decode multibase string ${JSON.stringify(text)}, ${this.name} decoder only supports inputs prefixed with ${this.prefix}`)

Or:

throw Error('Can only multibase decode strings'

So any exception thrown here can be considered a parse error.

The multibase encoded string looks like:

zQMqkjqqpU32Ke7GWFCkTcK

Which has a much more constrained alphabet compared to base64, and we always have the z prefix in front to indicate base58btc as opposed to other base encodings like base64 which would use m.

CMCDragonkai commented 3 years ago

I wonder whether to integrate multibase directly here in ID or leave it out for user to integrate outside in case they are likely to use it elsewhere. This would then ensure that this library focuses on binary representations and leaves the string encoding to the user.

CMCDragonkai commented 3 years ago

One of the issues with using a CSPRNG, is that acquiring data may be an asynchronous operation. This would mean that the ID generation would be asynchronous as well.

Note that sometimes there are synchronous or asynchronous APIs, both are possible: https://nodejs.org/api/crypto.html#crypto_crypto_randombytes_size_callback

However webcrypto API specifies synchronous CSPRNG.

crypto.getRandomValues(typedArray)
crypto.randomUUID()

And our usage of node-forge would mean that we have synchronous and asynchronous forms:

async function getRandomBytes(size: number): Promise<Buffer> {
  return Buffer.from(await random.getBytes(size), 'binary');
}

function getRandomBytesSync(size: number): Buffer {
  return Buffer.from(random.getBytesSync(size), 'binary');
}

So our random IDs should then be both synchronous and asynchronous. That is we may return the raw data or return a promise on the data.

CMCDragonkai commented 3 years ago

Note that CSPRNG and PRNG are all random number generators. CSPRNGs are better than PRNGs. But this is all pluggable, and for PK's purposes we are going to always package with a CSPRNG.

CMCDragonkai commented 3 years ago

I've been investigating the uuidv7 functionality. It is claimed to be lexicographic order. However I'm not sure if this is the case when used as a bytes and without encoding.

When I used lexicographic-integer https://github.com/substack/lexicographic-integer, their byte representation was also in lexicographic order, so I didn't have to hex encode to get the lexicographic order and leveldb supported buffer keys.

// lexi.pack(number) => Array<number> - this converts a number into an array of number byte format
// Buffer.from(Array<number>) => Buffer -> this converts the array of numbers into a binary buffer

Buffer.from(lexi.pack(100))

// the result is usable as a leveldb key

This was proven by several tests that I did in js-db. So I'm going to port over uuidv7 and test if its byte representation is still lexically ordered.

CMCDragonkai commented 3 years ago

Ok I've reviewed the UUIDv7 and UUIDv8 specs. It looks like the UUIDv8 spec would the most suitable to handle all of our usecases, and it would allow us to flexibly define and change the id structure.

The structure is:

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                          timestamp_32                         |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |           timestamp_48        |  ver  |      time_or_seq      |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |var|  seq_or_node  |          node                             |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                              node                             |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

And it is possible to potentially truncate the public key id to fit entirely in the uuidv8.

What ids are actually are in a shared context? In most cases you're going to always look up the node id before you look up the individual ids like vault id. Unless vault id were to be preserved across keynodes. It doesn't seem like you need to preserve it entirely if the vault remotes are used.

While vault ids may not need to be "shared" anymore.

I reckon claim ids may still be if we want to be able to refer to them.

Anyway we can make this optional to fill in the node part of the UUID or they are just CSPRNG filled random data.

CMCDragonkai commented 3 years ago

There's no reference implementation of UUIDv8 atm, so we will just have to start with UUIDv7 reference implementation and port it over.

CMCDragonkai commented 3 years ago

So here's some interesting issues with timestamps.

JavaScript has a performance.now() and performance.timeOrigin API.

This means:

import { performance } from 'perf_hooks';

// returns the number of milliseconds and microseconds in floating point
console.log(performance.now()); // 1206298.1301128864 => 1206298 milliseconds and 0.1301128864 milliseconds
// to convert to microseconds multiply by 1000
// 1206298130.1128864 microseconds
// 1206298130112.8864 nanoseconds
// 1206298130112886.4 picoseconds

// will be equivalent to Date.now() (but with extra precision of microseconds)
// there's not enough bit space to carry the nano seconds
console.log(performance.timeOrigin + performance.now());

However some browsers will limit this to 1 millisecond resolution. It seems chrome and nodejs doesn't limit it. Also it seems there's way more than microseconds here. It seems to go to all the way to picoseconds. I don't see any documentation describing why this is the case. I think in this case we may have to stick with milliseconds to be portable across browsers for the library, or stick with microseconds if we stick with Nodejs. But milliseconds might be better idea since we can share the library later.

It may actually be because JS floating point numbers have limited bit space, and thus this is the maximum largest floating point number: 1633483847712.9163 or 1.3249872394872892 which both are 17 digits. And thus when adding performance.timeOrigin + performance.now() we only get microseconds, or the 10 millionth number, but not the billionth number which would be nano seconds. JS also now has bigint too but it's not part of this standard.

The usage of microseconds, in a 64 bit timestamp would be able to reach 584542 years according to https://en.wikipedia.org/wiki/Microsecond: (2**64)/(1e6*60*60*24*365.25). Which is quite a lot of space, and unix time stamps would start from 1970 anyway.

The performance.now() is always monotonic apparently. Let's see what the spec says:

The time values returned when calling the now() method on Performance objects with the same time origin MUST use the same monotonic clock that is monotonically increasing and not subject to system clock adjustments or system clock skew. The difference between any two chronologically recorded time values returned from the Performance.now() method MUST never be negative if the two time values have the same time origin.

You see that it says "MUST never be negative", it doesn't say it's not allowed to be 0. Which means that it is monotonic, but not strictly monotonic, which would disallow 0. Meaning it is possible to have the same time returned. This is confirmed by running: console.log(performance.now(), performance.now()); on firefox which returns the same time. But this gives us at least half of the strictly monotonic properties.

However by itself this is not enough, we want to this be aligned with "wall clock time" as well, so we have some rough idea of when these events themselves have happened. And if we want to align it, we need to use performance.timeOrigin to add onto it.

The time values returned when getting performance.timeOrigin MUST use the same global monotonic clock that is shared by time origins, is monotonically increasing and not subject to system clock adjustments or system clock skew, and whose reference point is the [ECMA-262] time definition - see § 7. Privacy and Security.

The spec says it is monotonic, bu this doesn't seem to be the case, how would browsers or node runtime get a monotonic clock? Oh yea it is still bounded by when the process is restarted.

So I can imagine, we save the performance.timeOrigin, or whatever the last time is when we created a counter. Then if when we start again, and we read performance.timeOrigin, we see that it is lower than the saved time, we use the saved time as the reference point instead. If it is the same time, then we may increment by plus 1 at the lowest resolution. That means:

Start process, read performance.timeOrigin
Compare it to saved last time saved (assume 1 millisecond precision)
If the current time origin is lower or the same as the last time saved, then use the last time saved plus 1 as the reference time.
Use performance.now() and add to the reference time.
Because performance.now() is not strictly monotonic, compare it to the last timestamp generated at the same process context, and if it is the same, use the clock sequence parameter and add 1 to it to ensure that the ID is strictly monotonic.

The usage of the sequence is interesting as that it makes the ID monotonic, but a slight tweak to the above algorithm would allow one to always create strictly monotonic clock all the time and one could just use that.

CMCDragonkai commented 3 years ago

Another issue is the year 2038 problem, which if we only use 32 bits and only for seconds https://en.wikipedia.org/wiki/Year_2038_problem then our timestamp would not work. Given that we intend to use milliseconds or even microseconds we are likely to run out even more quickly, so we would need to use alot more bits. The UUIDv8 scheme allows 32 bits, then 16 bits, then 12 bits for a total of 60 bits. Then the seq_or_node provides another 8 bits. I reckon 60 bits should be enough to store milliseconds:

60 bits and 1e3 for milliseconds
(2**60)/(1e3*60*60*24*365.25)
= 36533877.8807 years

I reckon even 48 bits would be fine here.

48 bits and 1e3 for milliseconds
(2**48)/(1e3*60*60*24*365.25)
= 8919.40377946 years

If we use microseconds, then 48 bits is not enough, 60 bits is better.

60 bits with 1e6 microseconds
(2**60)/(1e6*60*60*24*365.25)
= 36533.8778807 years

If we use 64 bits, we have 4 bits left for sequence. 4 bits for sequence from 0 to 15, that is 16 numbers. If we use 60 bits, we have 8 bits left, that's 256 numbers. Anyway milliseconds at 48 bits would mean we have 12 bits for sequence, and the later 8 bits are just random or part of the node identifier.

So for now I reckon we use:

48 bits with millisecond resolution giving us a timestamp workable till the year 8919 from 1970.
12 bits of sequence for 4096 sequence, which would mean this would work as long as we are not generating 4096 timestamps within the same millisecond
Leave the 8 bits as parts of the larger node section making 62 bits possible to be random or injected with machine ID or whatever.

CMCDragonkai commented 3 years ago

I might have js-id wrap uuid library so we can make use of UUIDv4 for fully random and UUIDv5 for deterministic and end up implementing UUIDv8 here, or a variant of UUIDv7 from the uuidv7 reference impl https://github.com/kripod/uuidv7. Then abstract their APIs to suit PK usage. Such as focusing on binary forms, and then applying multibase on top or expecting the user to do this.

CMCDragonkai commented 3 years ago

Note that webcrypto is only available on Node v15 and above. The LTS version of NodeJS is still v14, and doesn't support it, and our @types/nodes is on v14 as well. Otherwise we would be able to:

import { webcrypto } from 'crypto';

const data = new Uint8Array(8);
webcrypto.getRandomValues(data);

In browsers, this is just crypto.getRandomValues(data); because crypto already just exists as the webcrypto object.

A simple feature test would be to test if crypto exists as an object using the globalThis which is standard across all platforms.

// this is true for Node and Browsers, Node has its crypto object it's not webcrypto though
'crypto' in globalThis

// on node v15 and greater
crypto.webcrypto.getRandomValues

So best test for whether we can just use the webcrypto API would be:

// globalThis is expected to exist, but if not then use globalThis?.crypto
if (typeof globalThis.crypto?.getRandomValues === 'function') {
  // use crypto.getRandomValues();
} else {
  // make sure to bring in some default or something else to bring in random values
}

CMCDragonkai commented 3 years ago

Proposed API for @matrixai/id. We will use generator and async generator interface. This indicates the idea that generating an id is part of a generator context, and thus reduces the likelihood of confusing side effects, and ensures that one can pass CSPRNG or clock source from the beginning.

import id from '@matrixai/id';

const idGen = id({/* ...options */});

const id1 = idGen.next();
const id2 = idGen.next();
const id3 = idGen.next();

// if we pass in asynchronous CSPRNG or source of information

const idGenAsync = id({async: true, /* ...options */});

await idGenAsync.next();
await idGenAsync.next();

Alternatively we can use ES6 classes with new ID() to return something that supports both synchronous and asynchronous iteration like in https://stackoverflow.com/questions/28739745/how-to-make-an-iterator-out-of-an-es6-class/28744141

import Id from '@matrixai/id';

const idGen = new Id();

idGen.next();
await idGen.next();
// i'm not sure if it's possible to have a class with both synchronous and asynchronous iteration?

Then we can support 3 kinds of ids:

random - based on uuidv4
deterministic - based on uuidv5
sortable - based on uuidv8

The generators will be independent from each other, so properties like monotonicity are not preserved across generators. One has to use the same generator instance. Except during process restart where we would put in the last timestamp created. Since the timestamp is internal data, one might instead ask for the last ID generated, and if you pass that in, it will ensure that the next ids are always ahead of the last id.

CMCDragonkai commented 3 years ago

Playing around with this protocols of iterable, async iterable, iterator and async iterator, I can see that it's not possible to have an object be both an async iterator and iterator. The async iterator interface demands that next returns a promise. While iterator interface demands that next returns a non-promise. It is however possible to implement both iterable and async iterable interfaces, since these are 2 different symbols Symbol.iterator and Symbol.asyncIterator.

It seems like our Id could be generic. That is:

class Id<T = Iterator<IDBuffer> | AsyncIterator<IDBuffer>> {
  next(): T {

  }
}

But it seems easier to just create 2 different classes to do this, plus the generator syntax simplifies alot of this, so a function construction might be better.

CMCDragonkai commented 3 years ago

Example usage:

function * id () {
  while (true) {
    const idData = new ArrayBuffer(16);
    uuid.v4({}, new Uint8Array(idData));
    yield idData;
  }
}

function toUUID(idData: ArrayBuffer): string {
  return uuid.stringify(new Uint8Array(idData));
}

function main () {
  const idGen = id();
  console.log(idGen.next());
  console.log(idGen.next());
  console.log(idGen.next());
  const u = idGen.next().value as ArrayBuffer;
  console.log(u);
  console.log(toUUID(u));
}

main();

The ArrayBuffer is just raw memory. This does mean that idGen.next() will give back { value, done } and one has to extract the value all the time which might be a bit annoying instead of always just using the value. That is, in our case, our generator will never return void, it will always keep generating. So it's more infinite. In that case, it might be better to have a function instead of generator interface. The generators do however make it easier to to get multiple ids easily like for..of and for await...of.

Maybe even spread but limited like [...take(idGen, 5)].

It seems like we would want something like idGen.get() to give us an id if it exists, and if done, it should just fail by throwing an exception. That way you maintain the generator, but also add additional method to just get the actual value. That way you preserve the generator interface, but also a simpler method to just get the id value directly since we know it will never end.

CMCDragonkai commented 3 years ago

function *take<T>(g: Iterator<T>, l: number): Generator<T> {
  for (let i = 0; i < l; i++) {
    const item = g.next();
    if (item.done) return;
    yield item.value;
  }
}

class IdSync implements IterableIterator<ArrayBuffer> {
  get(): ArrayBuffer {
    return this.next().value as ArrayBuffer;
  }

  next(): IteratorResult<ArrayBuffer, void> {
    const idData = new ArrayBuffer(16);
    uuid.v4({}, new Uint8Array(idData));
    return {
      value: idData,
      done: false
    };
  }

  [Symbol.iterator](): IterableIterator<ArrayBuffer> {
    return this;
  }
}

const idGen = new IdSync();
console.log([...take(idGen, 5)]);

The above creates a class that acts like generator. But the generator type has return and throw which we won't need here. So we have simplified the above. And that would give us the ability to create different kinds of "id generators", that can fit the above properties.

CMCDragonkai commented 3 years ago

It turns out trying to use top level await to do something like

let randomSource: (size: number) => Uint8Array;
if (typeof globalThis.crypto?.getRandomValues === 'function') {
  randomSource = (size) => globalThis.crypto.getRandomValues(new Uint8Array(size));
} else {
  const crypto = await import('crypto');
  randomSource = (size) => crypto.randomBytes(size);
}

let timeSource;
if (typeof globalThis.performance?.now === 'function' && typeof globalThis.performance.timeOrigin === 'number') {
  // use globalThis.performance.now() and globalThis.performance.timeOrigin
} else {
  const { performance } = await import('perf_hooks');
  // use performance.now() and performance.timeOrigin
}

Is quite difficult.

It requires some big changes:

using type: "module" in package.json
using module: "es2022"
using target: "es2020"
something to do with moduleResolution and also resolveJsonModules
having to deal with ts-node problems: https://github.com/TypeStrong/ts-node/issues/1007 (it has to be capable of interpreting the special module)
- it seems to mean having to change to import x from './x.ts'; needing the file extension which I thought we don't use cause the extension may change
and not sure what happens if downstream packages import this... seems like the runtime would need to support it

It's just not ready yet. But soon one day it can work.

So for now, the default randomSource is going to come from NodeJS crypto instead.

Another thing is that libraries generally should leave the browser bundling to the platform user. That keeps library simple so they don't have to have polyfills and assume too much, and one can always use different libraries for different platforms.

Also exploring the webcrypto API it appears acquiring random bytes is always synchronous. So in this case we will stick with an synchronous API here.

CMCDragonkai commented 3 years ago

We can do this now:

import { IdRandom, IdDeterministic, IdSortable, utils } from '@matrixai/id';

The IdSortable is the last one I'm implementing. The IdRandom and IdDeterministic are all derived from uuid library.

Right now we cannot really make this library isomorphic, so it is Node specific for now (until we figure out how to write isomorphic libraries, or browserify when we need it).

The APIs all return ArrayBuffer. But utils has utils.toUUID() which converts to the hex encoded UUID format.

Additionally we have the multibase library to allow encoding the ArrayBuffer subsequently in any multibase format, thus enabling multiple equivalent IDs, and perhaps a way to parse that back to the ArrayBuffer. Users are expected to use Uint8Array when referring to it.

When we use it in PK DB, it will need to be wrapped as Buffer keys with Buffer.from(ab).

CMCDragonkai commented 3 years ago

BTW apparently ArrayBuffer actually works for leveldb, it's just not documented: https://github.com/Level/level-js/issues/34#issuecomment-397811192

Might make type changes in js-db to test this and then it will be easier to use js-id without converting.

CMCDragonkai commented 3 years ago

After reviewing everything, I'm changing IdRandom to use UUIDv7. It's actually sufficient for what we want to do.

It uses 36 bits for unixts with seconds. This is (2**36)/(1*60*60*24*365.25) = 2177.58881334. This is the number of years beyond 1970. Which means roughly the year 4147.

This is the structure then:

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                            unixts                             |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |unixts |         msec          |  ver  |          seq          |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |var|                         rand                              |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                             rand                              |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

CMCDragonkai commented 3 years ago

I'm hitting a problem for how best to write 36 bits into the array buffer. Some notes here: https://github.com/uuid6/uuid6-ietf-draft/issues/11#issuecomment-937423331

It seems like it should be possible to left shift 12 bits, assuming that we have 64 bit integers... but may be we don't actually have 64 integers in JS.

> n = BigInt(1633579325)
1633579325n
> n
1633579325n
> n << 12
Uncaught TypeError: Cannot mix BigInt and other types, use explicit conversions
> n << 12n
6691140915200n
> n << 12n >> 12n
1633579325n

CMCDragonkai commented 3 years ago

Ok I have a python script that I can use as reference. It seems my bigint idea above isn't working.

```py from uuid import UUID from time import time_ns from random import randint TOTAL_BITS=128 VERSION_BITS = 4 VARIANT_BITS = 2 # Binary digits before the binary point SEC_BITS=38 # Binary digits after the binary point SUBSEC_BITS_S=0 SUBSEC_BITS_MS=10 SUBSEC_BITS_US=20 SUBSEC_BITS_NS=30 SUBSEC_BITS_DEFAULT=SUBSEC_BITS_NS # Decimal digits after the decimal point SUBSEC_DECIMAL_DIGITS_S=0 # 0 SUBSEC_DECIMAL_DIGITS_MS=3 # 0.999 SUBSEC_DECIMAL_DIGITS_US=6 # 0.999999 SUBSEC_DECIMAL_DIGITS_NS=9 # 0.999999999 SUBSEC_DECIMAL_DIGITS_DEFAULT=SUBSEC_DECIMAL_DIGITS_NS SLICE_MASK_0 = 0xffffffffffff00000000000000000000 SLICE_MASK_1 = 0x0000000000000fff0000000000000000 SLICE_MASK_2 = 0x00000000000000003fffffffffffffff def uuid7(t=None, subsec_bits=SUBSEC_BITS_DEFAULT, subsec_decimal_digits=SUBSEC_DECIMAL_DIGITS_DEFAULT): if t == None: t = time_ns() i = get_integer_part(t) f = get_fractional_part(t, subsec_decimal_digits) sec = i subsec = round(f * (2 ** subsec_bits)) node_bits = (TOTAL_BITS - VERSION_BITS - VARIANT_BITS - SEC_BITS - subsec_bits) uuid_sec = sec << (subsec_bits + node_bits) uuid_subsec = subsec << node_bits uuid_node = randint(0, (2 ** node_bits)) uuid_int = uuid_sec | uuid_subsec | uuid_node # 122 bits uuid_int = __add_version__(uuid_int) # 128 bits return UUID(int=uuid_int) def uuid7_s(t=None): return uuid7(t, SUBSEC_BITS_S, SUBSEC_DECIMAL_DIGITS_S) def uuid7_ms(t=None): return uuid7(t, SUBSEC_BITS_MS, SUBSEC_DECIMAL_DIGITS_MS) def uuid7_us(t=None): return uuid7(t, SUBSEC_BITS_US, SUBSEC_DECIMAL_DIGITS_US) def uuid7_ns(t=None): return uuid7(t, SUBSEC_BITS_NS, SUBSEC_DECIMAL_DIGITS_NS) def __add_version__(uuid_int): slice_mask_0 = SLICE_MASK_0 >> (VERSION_BITS + VARIANT_BITS) slice_mask_1 = SLICE_MASK_1 >> (VARIANT_BITS) slice_mask_2 = SLICE_MASK_2 slice_0 = (uuid_int & slice_mask_0) << (VERSION_BITS + VARIANT_BITS) slice_1 = (uuid_int & slice_mask_1) << (VARIANT_BITS) slice_2 = (uuid_int & slice_mask_2) uuid_output = slice_0 | slice_1 | slice_2 uuid_output = uuid_output & 0xffffffffffff0fff3fffffffffffffff # clear version uuid_output = uuid_output | 0x00000000000070008000000000000000 # apply version return uuid_output def __rem_version__(uuid_int): slice_0 = (uuid_int & SLICE_MASK_0) >> (VERSION_BITS + VARIANT_BITS) slice_1 = (uuid_int & SLICE_MASK_1) >> (VARIANT_BITS) slice_2 = (uuid_int & SLICE_MASK_2) uuid_output = slice_0 | slice_1 | slice_2 return uuid_output def get_integer_part(t): SUBSEC_DECIMAL_DIGITS_PYTHON=9 subsec_decimal_divisor = (10 ** SUBSEC_DECIMAL_DIGITS_PYTHON) return int(t / subsec_decimal_divisor) def get_fractional_part(t, subsec_decimal_digits=SUBSEC_DECIMAL_DIGITS_DEFAULT): SUBSEC_DECIMAL_DIGITS_PYTHON=9 subsec_decimal_divisor = (10 ** SUBSEC_DECIMAL_DIGITS_PYTHON) return round((t % subsec_decimal_divisor) / subsec_decimal_divisor, subsec_decimal_digits) def extract_sec(uuid): uuid_int = __rem_version__(uuid.int) uuid_sec = uuid_int >> (TOTAL_BITS - VERSION_BITS - VARIANT_BITS - SEC_BITS) return uuid_sec def extract_subsec(uuid, subsec_bits=SUBSEC_BITS_DEFAULT, subsec_decimal_digits=SUBSEC_DECIMAL_DIGITS_DEFAULT): uuid_int = __rem_version__(uuid.int) node_bits = (TOTAL_BITS - VERSION_BITS - VARIANT_BITS - SEC_BITS - subsec_bits) uuid_subsec = (uuid_int >> node_bits) & ((1 << (subsec_bits)) - 1) return round(uuid_subsec / (2 ** subsec_bits), subsec_decimal_digits) def list(): print("UUIDv7 sec in sec out subsec in subsec out") for i in range(10): t = time_ns() u = uuid7(t) i = get_integer_part(t) f = get_fractional_part(t) sec = extract_sec(u) subsec = extract_subsec(u) print(u, str(i).ljust(12), str(sec).ljust(12), str(f).ljust(12), str(subsec).ljust(12)) list() ```

CMCDragonkai commented 3 years ago

Ok so first point of different is the use of fixed point numbers instead of floating point numbers. This PDF explains how to convert a floating point number to a fixed point number:

http://ee.sharif.edu/~asic/Tutorials/Fixed-Point.pdf

This is used in the part of the script that converts the f fractional part of a second into subsec which is done by:

f = get_fractional_part(t, subsec_decimal_digits)
subsec = round(f * (2 ** subsec_bits))

Where subsec_bits is 10 for ms resolution and subsec_decimal_digits is 3.

So f is something like: 0.347 meaning 0.347 seconds, this was rounded to 3 digits from the nanosecond timestamp.

This was then converted to 355 by round(0.347 * (2 ** 10)).

Why is subsec_bits 10 for ms? It's the binary digits after the "binary point" that is needed to represent ms.

The upstream issue (https://github.com/uuid6/uuid6-ietf-draft/issues/11#issue-824103978) says:

Multiply the fraction of sub-seconds by 2^n, where n is the number of bits that fits to the system time precision. If the precision is millisecond, n is equal to 10 (2^10 = 1,024). If the precision is microsecond, n is equal to 20 (2^20 = 1.048.576) and so on;

And also:

unixtime: a 90-bit field reserved for the Unix timestamp. This field is encoded as a fixed-point number in the form fixed<38,n>, where 38 is the number of bits before the binary point, and n is the number of bits after the binary point. The size n is arbitrary so that it can be defined by the system that generates the UUID. The bits not used in this field can be filled with a sequence counter and/or some random bits, for example. The timestamp can represent dates up to the year 10680 A.D. (2^38/60/60/24/365.25 + 1970).

So here we see that this is earlier then the latest draft of the spec. The idea is to use a fixed point number with <38, n> where n is the number of bits reserved for the fractional portion.

The current state of the spec says that <36, n> instead where it is <36, 12> all together being 48 bits. Meaning we should use 12 as the subsec bit size.

I wonder what the advantages of using fixed point number here instead of just using the truncated subseconds? Maybe it's more accurate to do it this way?

CMCDragonkai commented 3 years ago

And now I find the up to date reference implementation in python: https://github.com/uuid6/prototypes/blob/main/python/new_uuid.py.

And yes, they are going ahead now with 36 bits there, and with the fixed point number system for the subsecond sections.

The python mechanism supports nanoseconds. Due to JS cross-platform limitations we are stuck with milliseconds for now even though other parts of this system only work in NodeJS for now. (Nodejs technically supports nanoseconds via hrtime).

CMCDragonkai commented 3 years ago

Ok so the in the new_uuid implementation, they rely on Python's ability to easily create bitstrings.

That is:

sec = 1633608789
unixts = f'{sec:032b}'
print(type(unixts)) # <class 'str'>
print(len(unixts)) # 32

The unixts is literally a string:

'01100001010111101110010001010101'

It is 32 characters long, and it is 0-padded from the left. The :032b basically creates a 32 long character long string with 0 padding (without the 0 it is just empty spaces).

The Python language can use bit strings, and turn them back into numbers with: int(unixts, 2) which gives us back the sec at 1633608789.

Therefore when they finally put together all the bits at:

UUIDv7_bin = unixts + subsec_a + uuidVersion + subsec_b + uuidVariant + subsec_seq_node

They are actually just concatenating strings together, and at the end it can be turned into a integer OR hex encoded:

UUIDv7_int = int(UUIDv7_bin, 2)
UUIDv7_hex = hex(int(UUIDv7_bin, 2))[2:]

CMCDragonkai commented 3 years ago

In JS, we can achieve a similar thing with parseInt:

> parseInt('01100001010111101110010001010101', 2)
1633608789

So that seems a lot easier to do then having to fiddle with typed arrays.

However this reminded me of something, the bitset data structure that I've used before in js-resource-counter. The library is still maintained today as https://github.com/infusion/bitset.js and the concept of a data structure for bit arrays is located here: https://en.wikipedia.org/wiki/Bit_array. There's even faster libraries these days like https://github.com/SalvatorePreviti/roaring-node which are using compressed bitmaps, but overkill for this library.

In the wiki article you can see that bit level manipulations can with bitwise ops against numbers, in this case the number acts like the "word" structure. It's a bit more esoteric as proper understanding of the bitwise ops are needed.

I think it's sufficient to actually use bitstrings, we just need to create the equivalent of f'{sec:032b}' in JS. And the answer is ehre: https://stackoverflow.com/a/16155417/582917

(sec >>> 0).toString(2).padStart(32, '0')

The >>> 0 is only needed to properly deal with negative numbers, however in our case we don't expect any negative numbers but it's harmless.

We can do typed array as an optimisation later when we have this working correctly and develop the proper test cases.

CMCDragonkai commented 3 years ago

To convert a bitstring back to an array of numbers that can be put into a typed array, we may need to chunk it up to 8 bits each and convert to a number.

Also the act of converting a number to a base 2 string outputs in big endian form. This makes sense as the 0 padding occurs on the left.

So we can go with bitstrings first and then later in the future optimise to using typedarrays directly acting as a bitmap (without bringing in newer libraries).

CMCDragonkai commented 3 years ago

The spec has been updated here: https://github.com/uuid6/uuid6-ietf-draft/blob/master/draft-peabody-dispatch-new-uuid-format-02.txt

But it's not published. So that should be followed.

There's a bug in the python reference implementation: https://github.com/uuid6/prototypes/issues/8

I'm now fixing that up in our JS port.

CMCDragonkai commented 3 years ago

I've got a working test script now:

```ts import { performance } from 'perf_hooks'; import crypto from 'crypto'; const uuidVariantBits = '10'; const uuidVersionBits = '0111'; // this returns milliseconds (with microsecond precision) let timestamp = performance.timeOrigin + performance.now(); console.log(timestamp); console.log(timestamp / 1000); // seconds as integer part const unixts = Math.trunc(timestamp / 1000); console.log(unixts); console.log('%', timestamp % 1000); // milliseconds is here // and this i a "fractional" part // so this is the number of seconds in "millisecond" form // this will have to be stored in 12 bits console.log((timestamp % 1000) / 1000); // this is the "milliseconds" as fractional // we know this will always be 3 decimal places const msec = roundFixed((timestamp % 1000) / 1000, 3, 10); console.log(msec); const msecSize = 12; const msecFixed = Math.round(msec * (2 ** msecSize)); console.log(msecFixed); // other way (when parsing) console.log(roundFixed(msecFixed / (2 ** msecSize), 3)); const unixtsBits = dec2bin(unixts, 36); console.log(unixtsBits); console.log(unixtsBits.length); const msecBits = dec2bin(msecFixed, 12); console.log(msecBits); console.log(msecBits.length); const seq = 0; const seqBits = dec2bin(seq, 12); console.log(seqBits); // 64 bits const randomBytes = crypto.randomBytes(8); const randomBits = [...randomBytes].map((n) => dec2bin(n, 8)).join(''); console.log([...randomBytes]); console.log(randomBits); console.log(randomBits.length); const randomBits_ = randomBits.substr(0, 62); console.log(randomBits_); console.log(randomBits_.length); // nice const uuidBits = unixtsBits + msecBits + uuidVersionBits + seqBits + uuidVariantBits + randomBits_; // all done console.log(uuidBits); console.log(uuidBits.length); function roundFixed(num: number, digits: number, base?: number){ const pow = Math.pow(base ?? 10, digits); return Math.round((num + Number.EPSILON) * pow) / pow; } // binary formatted string is naturally big-endian // will pad with 0 in front // see: https://stackoverflow.com/a/16155417/582917 function dec2bin(dec: number, size: number): string { return (dec >>> 0).toString(2).padStart(size, '0'); } // this is the "integer form of 128 bits" // i'm sure we couldn't actually do this // but it's a large number // const uuidInt = parseInt(uuidBits, 2); // the uuidBits is too big, how do you convert it to hex? // try something different // console.log(uuidInt.toString(16)); // 61628e3e-a197-000b-6c2e-87b65fc9bc5 // 123e4567-e89b-12d3-a456-426614174000 // i'm getting only 31 characters // not 32 characters... why? function bin2hex(b) { return b.match(/.{4}/g).reduce(function(acc, i) { return acc + parseInt(i, 2).toString(16); }, '') } function hex2bin(h) { return h.split('').reduce(function(acc, i) { return acc + ('000' + parseInt(i, 16).toString(2)).substr(-4, 4); }, '') } const uuidHex = (bin2hex(uuidBits)); console.log(uuidHex); console.log(uuidHex.length); ```

CMCDragonkai commented 3 years ago

I've created a number of utility functions to help the creation of IdSortable:

toUUID
bytes2hex
hex2bytes
bytes2bin
bin2bytes
dec2bin
dec2hex
strChunks
roundPrecise
toFixedPoint
fromFixedPoint

Additional utility functions maybe used to encode and decode to multibase but I'll leave that later.

Tests have been added for all of the above.

CMCDragonkai commented 3 years ago

It's working so far. Some more needed to get it ready for prod.

Regarding the last id tracking, I reckon if the last id is earlier than the current time origin, then we should only use it as the origin plus 1 if the performance.now() returns 0.

CMCDragonkai commented 3 years ago

One of the cool things I just realised is that because the timeSource is customisable, it's possible to create a logical clock (https://en.wikipedia.org/wiki/Logical_clock) as well to create sortable IDs that don't correspond to real time. It's worthwhile to create a test to demonstrate this.

CMCDragonkai commented 3 years ago

The IdSortable is implemented now, made changes to the timeSource so that it only increments by 1 if the performance.now() returns 0 when the lastId is greater or equal to the current performance.timeOrigin.

CMCDragonkai commented 3 years ago

TS accepts Buffer where functions require ArrayBuffer. However this actually causes problems. We need to throw an exception if the data passed in is ArrayBuffer. It's a bit of a silent error and difficult to debug because new Uint8Array(...) when given a Node buffer doesn't properly create a truncated view.

For now I'll leave this as documented.

CMCDragonkai commented 3 years ago

Multibase encoding and decoding has been added in.

However due to a problem involving jsdom: https://github.com/facebook/jest/pull/8022 I've had to upgrade the jest testing facilities.

This may impact downstream PK, so we have to update TypeScript-Demo-Lib and PK related testing.

  -    "@types/jest": "^26.0.20",
  +    "@types/jest": "^27.0.2",
       "@types/node": "^14.14.35",
       "@types/node-forge": "^0.10.4",
       "@typescript-eslint/eslint-plugin": "^4.12.0",
  @@ -31,9 +31,9 @@
       "eslint": "^7.17.0",
       "eslint-config-prettier": "^7.1.0",
       "eslint-plugin-prettier": "^3.3.1",
  -    "jest": "^26.6.3",
  +    "jest": "^27.2.5",
       "prettier": "^2.2.1",
  -    "ts-jest": "^26.4.4",
  +    "ts-jest": "^27.0.5",

@joshuakarp

CMCDragonkai commented 3 years ago

This is published now: https://www.npmjs.com/package/@matrixai/id

All done and ready for integration into PK.

MatrixAI / js-id