MatrixAI / js-id

ID generation for JavaScript & TypeScript Applications
https://polykey.com
Apache License 2.0
10 stars 1 forks source link

Not all base encodings maintain lexicographic-order #7

Closed CMCDragonkai closed 2 years ago

CMCDragonkai commented 2 years ago

Specification

It turns out that not all base encodings maintain lexicographic-order. For example base64 and base58 have alphabets that do not preserve order.

After some random testing, I found that the existing test case fails very rarely: encoded multibase strings are lexically sortable.

I believe this applies to the base encoding but not to the binary string encoding.

Therefore that test doesn't make sense, it makes more sense to identify which base encodings are lexicographic order, and create separate tests for those.

In our README.md it will be important to state that some base encodings will lose the lexicographic order, and provide a selection of the encodings that will maintain order.

Additional context

Tasks

  1. [x] - Devise a good test to test lexicographic order
  2. [x] - Identify the base encodings that maintain order
  3. [x] - Fix README.md
  4. [x] - Ensure all PK uses of base encoded IdSortable should be using lexicographic order preserving base encodings if need be
CMCDragonkai commented 2 years ago

Here's an output of 1 run of the tests that used base58btc that didn't work.

```ts const before = [ 'zkcpMuSTH5Qyms6UNeUzEy', 'zkcpMuSTH5RhYRusw9Aos2', 'zkcpMuSTH5SVXZ7hEnTWYo', 'zkcpMuSTH5TFYUrYfaW8d5', 'zkcpMuSTH5U5amB67CJuEn', 'zkcpMuSTH5UoX55jgxLYJ6', 'zkcpMuSTH5VToqnRqZ5TDm', 'zkcpMuSTH5WDBUrciCYNne', 'zkcpMuSTH5WtGFmdCRb3tU', 'zkcpMuSTH5XhX2HQhaiV1C', 'zkcpMuSTH5YWRR29zmGbbB', 'zkcpMuSUGcAbYFfKACqKSf', 'zkcpMuSUGcBDZQLbRxVPLi', 'zkcpMuSUGcBz7LGXvvE62f', 'zkcpMuSUGcCpebDz2r93qg', 'zkcpMuSUGcDUExCi1WSxyD', 'zkcpMuSUGcEHoyquHbbMQh', 'zkcpMuSUGcEx9r2hU6rxgk', 'zkcpMuSUGcFggSo43U5AYu', 'zkcpMuSUGcGWvB4eoF47q6', 'zkcpMuSUGcHATR4Rb4Qm6D', 'zkcpMuSUGcHsLmRrJHpxvr', 'zkcpMuSUGcJboerZkgG2cW', 'zkcpMuSUGcKMfPt4PPn55G', 'zkcpMuSUGcLBcMxrsUa4A4', 'zkcpMuSUGcLxsyJUWqdDJv', 'zkcpMuSUGcMZAYyeXaDwqq', 'zkcpMuSUGcNPvRMEVmnQwV', 'zkcpMuSUGcP817AAoypmvr', 'zkcpMuSUGcPtGe231pX5yG', 'zkcpMuSUGcQbTcZGuH86hk', 'zkcpMuSUGcRKNDXS6jhvCc', 'zkcpMuSUGcSAN5wPhTYJVA', 'zkcpMuSUGcSpud3wKYfMq5', 'zkcpMuSUGcTXp6EZSSquSb', 'zkcpMuSUGcUKQ4wni8wTY1', 'zkcpMuSUGcV5VDut2nSCcZ', 'zkcpMuSUGcViPPW3knDtGc', 'zkcpMuSUGcWWpbQG3yHbEm', 'zkcpMuSUGcXErpiTPJZ3Ww', 'zkcpMuSUGcY1emU9UbMtzZ', 'zkcpMuSUGcYkaLTM7oTTiF', 'zkcpMuSUGcZYScihhBjBHP', 'zkcpMuSUGcaE1UnuPFAUQV', 'zkcpMu8yZEW76wabV3b3hp', 'zkcpMu8yZEWowitnvSHsT7', 'zkcpMu8yZEXdGUftrimrdn', 'zkcpMu8yZEYEfEBq21arQ4', 'zkcpMu8yZEZ2rZzHGdenWt', 'zkcpMu8yZEZtYbmLQ8rdDe', 'zkcpMu8yZEacbXxgzL9TkZ', 'zkcpMu8yZEbLbVt7qUiDAh', 'zkcpMu8yZEc34AvFsqmiPN', 'zkcpMu8yZEcg2rpRkkZNK7', 'zkcpMu8yZEdRsKBNpcKNLd', 'zkcpMu8yZEeDNG86DD9twf', 'zkcpMu8yZEf3oMtHXjj8q3', 'zkcpMu8yZEfhKkh272YNGB', 'zkcpMu8yZEgX1wAD1mAxwt', 'zkcpMu8yZEhC9EoSSKvzFy', 'zkcpMu8yZEhuHZ9zYsFnVv', 'zkcpMu8yZEiekjCnaQhtf8', 'zkcpMu8yZEjRxY1jyeBvEX', 'zkcpMu8yZEk8Km9v87osnH', 'zkcpMuSVG8v5HZtJMvBqNX', 'zkcpMuSVG8vqZ5HZ2oQ2oA', 'zkcpMuSVG8wdi2NFvMUKBx', 'zkcpMuSVG8xGK3kPdbKRed', 'zkcpMuSVG8y6BqbjKEHT5r', 'zkcpMuSVG8yoUT2qFwSy8F', 'zkcpMuSVG8zWzgwo83AP9P', 'zkcpMuSVG91DkfC6tDagfy', 'zkcpMuSVG925GvnqC1fkdr', 'zkcpMuSVG92fMrnAT72UJb', 'zkcpMuSVG93VYjec6Zu7pH', 'zkcpMuSVG94AnXqMsdgQKU', 'zkcpMuSVG94vScfTeFbkeG', 'zkcpMuSVG95ibzkYjhdsrH', 'zkcpMuSVG96PA4v88n6VRn', 'zkcpMuSVG97FNxp3Jx6bpX', 'zkcpMuSVG97wfXQgXmvCcV', 'zkcpMuSVG98exdYsFqKEbJ', 'zkcpMuSVG99KSPXmR2KJ7w', 'zkcpMuSWFffe3jRbzWLCBx', 'zkcpMuSWFfgGrnkzyryHgZ', 'zkcpMuSWFfh8uHbZNhKggy', 'zkcpMuSWFfhqzdV4Mre24h', 'zkcpMuSWFfiY8Ysy6Fk88N', 'zkcpMuSWFfjMy3qN6uzsoz', 'zkcpMuSWFfjz7yUobzjEJ3', 'zkcpMuSWFfksLia9Rcfzup', 'zkcpMuSWFfmZXuDi6YMSit', 'zkcpMuSWFfnEyHAHESWEWD', 'zkcpMuSWFfnwRkdJY8uDvK', 'zkcpMuSWFfofz5CUsb2v7D', 'zkcpMuSWFfpUUPUXYboULf', 'zkcpMuSWFfq8TvJyuSnzqy', 'zkcpMuSWFfqzQt7U5PTUYW', 'zkcpMuSWFfritht74m4uBv', 'zkcpMuSWFfsMjc1zgLsfoc' ]; const sorted = [ 'zkcpMu8yZEW76wabV3b3hp', 'zkcpMu8yZEWowitnvSHsT7', 'zkcpMu8yZEXdGUftrimrdn', 'zkcpMu8yZEYEfEBq21arQ4', 'zkcpMu8yZEZ2rZzHGdenWt', 'zkcpMu8yZEZtYbmLQ8rdDe', 'zkcpMu8yZEacbXxgzL9TkZ', 'zkcpMu8yZEbLbVt7qUiDAh', 'zkcpMu8yZEc34AvFsqmiPN', 'zkcpMu8yZEcg2rpRkkZNK7', 'zkcpMu8yZEdRsKBNpcKNLd', 'zkcpMu8yZEeDNG86DD9twf', 'zkcpMu8yZEf3oMtHXjj8q3', 'zkcpMu8yZEfhKkh272YNGB', 'zkcpMu8yZEgX1wAD1mAxwt', 'zkcpMu8yZEhC9EoSSKvzFy', 'zkcpMu8yZEhuHZ9zYsFnVv', 'zkcpMu8yZEiekjCnaQhtf8', 'zkcpMu8yZEjRxY1jyeBvEX', 'zkcpMu8yZEk8Km9v87osnH', 'zkcpMuSTH5Qyms6UNeUzEy', 'zkcpMuSTH5RhYRusw9Aos2', 'zkcpMuSTH5SVXZ7hEnTWYo', 'zkcpMuSTH5TFYUrYfaW8d5', 'zkcpMuSTH5U5amB67CJuEn', 'zkcpMuSTH5UoX55jgxLYJ6', 'zkcpMuSTH5VToqnRqZ5TDm', 'zkcpMuSTH5WDBUrciCYNne', 'zkcpMuSTH5WtGFmdCRb3tU', 'zkcpMuSTH5XhX2HQhaiV1C', 'zkcpMuSTH5YWRR29zmGbbB', 'zkcpMuSUGcAbYFfKACqKSf', 'zkcpMuSUGcBDZQLbRxVPLi', 'zkcpMuSUGcBz7LGXvvE62f', 'zkcpMuSUGcCpebDz2r93qg', 'zkcpMuSUGcDUExCi1WSxyD', 'zkcpMuSUGcEHoyquHbbMQh', 'zkcpMuSUGcEx9r2hU6rxgk', 'zkcpMuSUGcFggSo43U5AYu', 'zkcpMuSUGcGWvB4eoF47q6', 'zkcpMuSUGcHATR4Rb4Qm6D', 'zkcpMuSUGcHsLmRrJHpxvr', 'zkcpMuSUGcJboerZkgG2cW', 'zkcpMuSUGcKMfPt4PPn55G', 'zkcpMuSUGcLBcMxrsUa4A4', 'zkcpMuSUGcLxsyJUWqdDJv', 'zkcpMuSUGcMZAYyeXaDwqq', 'zkcpMuSUGcNPvRMEVmnQwV', 'zkcpMuSUGcP817AAoypmvr', 'zkcpMuSUGcPtGe231pX5yG', 'zkcpMuSUGcQbTcZGuH86hk', 'zkcpMuSUGcRKNDXS6jhvCc', 'zkcpMuSUGcSAN5wPhTYJVA', 'zkcpMuSUGcSpud3wKYfMq5', 'zkcpMuSUGcTXp6EZSSquSb', 'zkcpMuSUGcUKQ4wni8wTY1', 'zkcpMuSUGcV5VDut2nSCcZ', 'zkcpMuSUGcViPPW3knDtGc', 'zkcpMuSUGcWWpbQG3yHbEm', 'zkcpMuSUGcXErpiTPJZ3Ww', 'zkcpMuSUGcY1emU9UbMtzZ', 'zkcpMuSUGcYkaLTM7oTTiF', 'zkcpMuSUGcZYScihhBjBHP', 'zkcpMuSUGcaE1UnuPFAUQV', 'zkcpMuSVG8v5HZtJMvBqNX', 'zkcpMuSVG8vqZ5HZ2oQ2oA', 'zkcpMuSVG8wdi2NFvMUKBx', 'zkcpMuSVG8xGK3kPdbKRed', 'zkcpMuSVG8y6BqbjKEHT5r', 'zkcpMuSVG8yoUT2qFwSy8F', 'zkcpMuSVG8zWzgwo83AP9P', 'zkcpMuSVG91DkfC6tDagfy', 'zkcpMuSVG925GvnqC1fkdr', 'zkcpMuSVG92fMrnAT72UJb', 'zkcpMuSVG93VYjec6Zu7pH', 'zkcpMuSVG94AnXqMsdgQKU', 'zkcpMuSVG94vScfTeFbkeG', 'zkcpMuSVG95ibzkYjhdsrH', 'zkcpMuSVG96PA4v88n6VRn', 'zkcpMuSVG97FNxp3Jx6bpX', 'zkcpMuSVG97wfXQgXmvCcV', 'zkcpMuSVG98exdYsFqKEbJ', 'zkcpMuSVG99KSPXmR2KJ7w', 'zkcpMuSWFffe3jRbzWLCBx', 'zkcpMuSWFfgGrnkzyryHgZ', 'zkcpMuSWFfh8uHbZNhKggy', 'zkcpMuSWFfhqzdV4Mre24h', 'zkcpMuSWFfiY8Ysy6Fk88N', 'zkcpMuSWFfjMy3qN6uzsoz', 'zkcpMuSWFfjz7yUobzjEJ3', 'zkcpMuSWFfksLia9Rcfzup', 'zkcpMuSWFfmZXuDi6YMSit', 'zkcpMuSWFfnEyHAHESWEWD', 'zkcpMuSWFfnwRkdJY8uDvK', 'zkcpMuSWFfofz5CUsb2v7D', 'zkcpMuSWFfpUUPUXYboULf', 'zkcpMuSWFfq8TvJyuSnzqy', 'zkcpMuSWFfqzQt7U5PTUYW', 'zkcpMuSWFfritht74m4uBv', 'zkcpMuSWFfsMjc1zgLsfoc' ]; ```

The problem here is that zkcpMu8... is sorted before zkcpMuS.... The number 8 here is before letter S.

The base58btc alphabet indicates that it is in lexicographic order though...

https://github.com/multiformats/js-multiformats/blob/master/src/bases/base58.js#L6

  alphabet: '123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'

There might be something I'm forgetting here.

CMCDragonkai commented 2 years ago

I'm a bit confused here, the base58btc alphabet is in lexicographic order, but the above is where the error occurred. It seems there's more to preserving lexicographic order than just having the right alphabet order. The link https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding says that there's more to it. It could also be due to JS's sorting algorithm is doing something strange.

I'm going to try to decode those back to ids, to confirm that they are in fact the right order with buffer comparison and see what the difference is between base58btc and doing hex encoding which should preserve order.

CMCDragonkai commented 2 years ago

So after some tests, I've found that:

  1. base58btc seems already lexicographically sorted
  2. the ids being generated are the ones with wrong order... it starts as 1634787496.998 and then reverses at 1634787496 then goes to 1634787497.
  3. the sorted version is actually correctly sorted according to both buffer compare and binary string compare (which sort of means that base58btc is actually preserving order)
  4. why are the ids generated wrong though? Could there be an error extractTs?
CMCDragonkai commented 2 years ago

Ok there's a bug inside IdSortable that is not even about the base58 encoding atm.

Here's a script that after running some time, will eventually find an error:

import { extractTs, extractSeq } from './src/IdSortable';
import { IdSortable, utils } from './src';

const idGen = new IdSortable();

const findTheProblem = () => {
  let count = 100;
  while (count > 0) {
    const ids = [...utils.take(idGen, 100)];
    const ids_ = ids.slice();
    ids_.sort(Buffer.compare);
    for (let i = 0; i < 100; i++) {
      if (ids[i].toString() !== ids_[i].toString()) {
        console.log('FOUND AN INEQUALITY');
        console.log('COUNT', count);
        console.log('INDEX', i);
        console.log('ORIGINAL', ids[i]);
        console.log('SORTED', ids_[i]);
        console.log('ORIGINAL', extractTs(ids[i]), extractSeq(ids[i]));
        console.log('SORTED', extractTs(ids_[i]), extractSeq(ids_[i]));
        return [ids, ids_];
      }
    }
    count--;
  }
  return [[], []];
};

const [ids, ids_] = findTheProblem();

console.log(ids.map((id) => utils.toMultibase(id, 'base58btc')));
console.log(ids_.map((id) => utils.toMultibase(id, 'base58btc')));

In my case I found it after generating more than 900 ids.

The problem looks like this:

FOUND AN INEQUALITY
COUNT 91
INDEX 0
ORIGINAL IdInternal(16) [Uint8Array] [
    6, 23,  15,  31, 143, 248,
  112,  8, 181, 148, 192, 253,
  237, 19, 100, 103
]
SORTED IdInternal(16) [Uint8Array] [
    6, 23,  15, 31, 128,   0,
  112,  0, 173, 24,  21, 244,
  176, 94, 231, 60
]
UNIXTSBITS 000001100001011100001111000111111000
MSECBITS 111111111000
ORIGINAL 1634791928.998 8
UNIXTSBITS 000001100001011100001111000111111000
MSECBITS 000000000000
SORTED 1634791928 0

This shows that at index 0, the id provided by the IdSortable generator is providing a timestamp and millisecond output that is greater than the sorted version. The sorted Id comes from a different position, but it would have to be later than index 0.

As you can see the extracted TS in seconds and subseconds is 1634791928.998 for original, and then for the sorted is 1634791928. Which means the sorted one is actually more correct.

When extracting the TS, we are getting 0 bits. This means the decoding/extraction is correct. Somewhere during the encoding or generation of the ID results in generating an Id that is actually earlier according to its TS.

CMCDragonkai commented 2 years ago

Neither decoding nor encoding is the problem.

The problem is the toFixedPoint call.

See here are 2 ids generated one after another, and see how the encoded msecBits become all zeros.

SNEAKY {
  ts: 1634793748.9987211,
  unixts: 1634793748,
  msec: 4092,
  unixtsBits: '000001100001011100001111100100010100',
  msecBits: '111111111100',
  seqBits: '000000000110'
} {
  ts: 1634793748.9995031,
  unixts: 1634793748,
  msec: 4096,
  unixtsBits: '000001100001011100001111100100010100',
  msecBits: '000000000000',
  seqBits: '000000000000'
}
CMCDragonkai commented 2 years ago

An msec of 4096 results in 12 bits of 0. An msec of 4095 would be 12 bits of 1.

The toFixedPoint is returning the largest possible msec value which is 4096 derived from a TS of 1634793748.9995031. And this results in the msec setting back to all zeros.

To fix this, we have to fix the rounding occurring in toFixedPoint, so that 4096 cannot be returned. Or if 4096 is returned, then we would have to adjust the unixts accordingly.

The maximum msec that toFixedPoint should be returning is 4095. Beyond that, the unixts has to increment.

CMCDragonkai commented 2 years ago

This is now fixed. My tests show that base58btc does preserve the lexicographic sort order. So since we are using this in PK, this should be safe @tegefaulkes. Just make sure we don't use base64 encoding when we are using IdSortable in PK.

@joshuakarp this should be reminded to go into our reference documentation.