Closed CMCDragonkai closed 3 years ago
Here's an output of 1 run of the tests that used base58btc
that didn't work.
The problem here is that zkcpMu8...
is sorted before zkcpMuS...
. The number 8
here is before letter S
.
The base58btc alphabet indicates that it is in lexicographic order though...
https://github.com/multiformats/js-multiformats/blob/master/src/bases/base58.js#L6
alphabet: '123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'
There might be something I'm forgetting here.
I'm a bit confused here, the base58btc alphabet is in lexicographic order, but the above is where the error occurred. It seems there's more to preserving lexicographic order than just having the right alphabet order. The link https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding says that there's more to it. It could also be due to JS's sorting algorithm is doing something strange.
I'm going to try to decode those back to ids, to confirm that they are in fact the right order with buffer comparison and see what the difference is between base58btc and doing hex encoding which should preserve order.
So after some tests, I've found that:
base58btc
seems already lexicographically sorted1634787496.998
and then reverses at 1634787496
then goes to 1634787497
.extractTs
?Ok there's a bug inside IdSortable
that is not even about the base58 encoding atm.
Here's a script that after running some time, will eventually find an error:
import { extractTs, extractSeq } from './src/IdSortable';
import { IdSortable, utils } from './src';
const idGen = new IdSortable();
const findTheProblem = () => {
let count = 100;
while (count > 0) {
const ids = [...utils.take(idGen, 100)];
const ids_ = ids.slice();
ids_.sort(Buffer.compare);
for (let i = 0; i < 100; i++) {
if (ids[i].toString() !== ids_[i].toString()) {
console.log('FOUND AN INEQUALITY');
console.log('COUNT', count);
console.log('INDEX', i);
console.log('ORIGINAL', ids[i]);
console.log('SORTED', ids_[i]);
console.log('ORIGINAL', extractTs(ids[i]), extractSeq(ids[i]));
console.log('SORTED', extractTs(ids_[i]), extractSeq(ids_[i]));
return [ids, ids_];
}
}
count--;
}
return [[], []];
};
const [ids, ids_] = findTheProblem();
console.log(ids.map((id) => utils.toMultibase(id, 'base58btc')));
console.log(ids_.map((id) => utils.toMultibase(id, 'base58btc')));
In my case I found it after generating more than 900 ids.
The problem looks like this:
FOUND AN INEQUALITY
COUNT 91
INDEX 0
ORIGINAL IdInternal(16) [Uint8Array] [
6, 23, 15, 31, 143, 248,
112, 8, 181, 148, 192, 253,
237, 19, 100, 103
]
SORTED IdInternal(16) [Uint8Array] [
6, 23, 15, 31, 128, 0,
112, 0, 173, 24, 21, 244,
176, 94, 231, 60
]
UNIXTSBITS 000001100001011100001111000111111000
MSECBITS 111111111000
ORIGINAL 1634791928.998 8
UNIXTSBITS 000001100001011100001111000111111000
MSECBITS 000000000000
SORTED 1634791928 0
This shows that at index 0, the id provided by the IdSortable
generator is providing a timestamp and millisecond output that is greater than the sorted version. The sorted Id comes from a different position, but it would have to be later than index 0
.
As you can see the extracted TS in seconds and subseconds is 1634791928.998
for original, and then for the sorted is 1634791928
. Which means the sorted one is actually more correct.
When extracting the TS, we are getting 0
bits. This means the decoding/extraction is correct. Somewhere during the encoding or generation of the ID results in generating an Id that is actually earlier according to its TS.
Neither decoding nor encoding is the problem.
The problem is the toFixedPoint
call.
See here are 2 ids generated one after another, and see how the encoded msecBits
become all zeros.
SNEAKY {
ts: 1634793748.9987211,
unixts: 1634793748,
msec: 4092,
unixtsBits: '000001100001011100001111100100010100',
msecBits: '111111111100',
seqBits: '000000000110'
} {
ts: 1634793748.9995031,
unixts: 1634793748,
msec: 4096,
unixtsBits: '000001100001011100001111100100010100',
msecBits: '000000000000',
seqBits: '000000000000'
}
An msec
of 4096
results in 12 bits of 0. An msec
of 4095
would be 12 bits of 1.
The toFixedPoint
is returning the largest possible msec value which is 4096
derived from a TS of 1634793748.9995031
. And this results in the msec setting back to all zeros.
To fix this, we have to fix the rounding occurring in toFixedPoint
, so that 4096
cannot be returned. Or if 4096
is returned, then we would have to adjust the unixts accordingly.
The maximum msec that toFixedPoint
should be returning is 4095
. Beyond that, the unixts
has to increment.
This is now fixed. My tests show that base58btc
does preserve the lexicographic sort order. So since we are using this in PK, this should be safe @tegefaulkes. Just make sure we don't use base64 encoding when we are using IdSortable
in PK.
@joshuakarp this should be reminded to go into our reference documentation.
Specification
It turns out that not all base encodings maintain lexicographic-order. For example base64 and base58 have alphabets that do not preserve order.
After some random testing, I found that the existing test case fails very rarely:
encoded multibase strings are lexically sortable
.I believe this applies to the base encoding but not to the binary string encoding.
Therefore that test doesn't make sense, it makes more sense to identify which base encodings are lexicographic order, and create separate tests for those.
In our README.md it will be important to state that some base encodings will lose the lexicographic order, and provide a selection of the encodings that will maintain order.
Additional context
Tasks
IdSortable
should be using lexicographic order preserving base encodings if need be