Primitive conversions - Githubissues

CMCDragonkai commented 2 years ago

Specification

Sometimes the ID is used as a string when used in POJO objects or ES6 maps. In these cases, an ArrayBuffer is not easily turned into a string.

What we can do is to make use of ideas from https://javascript.info/object-toprimitive. This will enable us the ability to convert it to primitives.

There are 2 main primitives strings or numbers. I don't believe there is a proper numeric representation of Ids. This is because the Ids are 128 bits, and won't fit into a JS number. And even BigInt is only 64 bits. It would only be possible by truncating the 128 bits into a number. This could be done by using new Float64Array(2) and putting all 128 bits into that. But again it wouldn't really mean much. Except perhaps by understanding the first 64 bits as a floating point number (of which the last bit may make the numbers negative).

So for now, we can instead represent numbers as NaN. And this is the case with ArrayBuffer like +ab is NaN.

More useful is the string representation. The 2 hints that lead to string primitive is the string hint and the default hint and also the toString() call.

// output
alert(obj);

// using object as a property key
anotherObj[obj] = 123;

If the binary string version of the 128 bit identifier can be sorted in the same way that Buffer.compare does it, then this could be done.

It would be ideal that we could do id1 > id2 too, but this uses the number hint.

So basically we can try:

string and default to binary string
ensure that binary string comparison is the same as Buffer.compare
deal with the lack of appropriate number comparison and maybe id1 > id2 could be achieved
extending class Id extends ArrayBuffer could be used, but some of these operations will require direct access, which means Uint8Array would be preferred
alternatively if we would expect that end-users may use feross/buffer as a devDependency when bundling, then it's also possible to use import { Buffer } from 'buffer'; and this can simplify our comparisons and integration into the rest of PK

Additional context

https://javascript.info/object-toprimitive
https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/205#note_702195122 - discussion leading to this change
Note that Buffer is Uint8Array and Uint8Array is ArrayBuffer, so Buffer is the most flexible. However there is the issue of detached array buffers. Node buffers aren't able to be detached: https://github.com/MatrixAI/js-polykey/issues/220
2 - if we use Buffer, that impacts the goal to make ES compliant, but those are separate concerns...
https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary

Tasks

[x] - Experiment with Symbol.toPrimitive, toString and valueOf
[x] - Experiment with extending ArrayBuffer, or Uint8Array or Buffer if it makes it easier...
[x] - Update tests for new types
[x] - Add new tests for primitive usage like using in POJOs, Maps and comparisons.

CMCDragonkai commented 2 years ago

class Left {
  public [Symbol.toPrimitive](hint: 'string' | 'number' | 'default') {
    return 'a';
  }
}

class Right {
  public [Symbol.toPrimitive](hint: 'string' | 'number' | 'default') {
    return 'b';
  }
}

const left = new Left;
const right = new Right;

console.log(left < right);
console.log(left <= right);
console.log(left > right);
console.log(left >= right);

The above shows that hint will be number on these comparisons, but toPrimitive can return a string instead. The hint is just a hint. You don't have to abide by it. Then the result is that they are "cast" to 'a' < 'b'. Which in the case of string comparison is correct.

If compareFunction is not supplied, all non-undefined array elements are sorted by converting them to strings and comparing strings in UTF-16 code units order. For example, "banana" comes before "cherry".

Note: In UTF-16, Unicode characters above \uFFFF are encoded as two surrogate code units, of the range \uD800-\uDFFF. The value of each code unit is taken separately into account for the comparison. Thus the character formed by the surrogate pair \uD655\uDE55 will be sorted before the character \uFF3A.

So it's the value of each "code unit". Each code unit in UTF 16 may be 2 bytes. If we convert our strings as binary strings.

However when using Buffer.from(...).toString('binary') this is an alias for the latin1 encoding. The node docs say:

'latin1': Latin-1 stands for ISO-8859-1. This character encoding only supports the Unicode characters from U+0000 to U+00FF. Each character is encoded using a single byte. Characters that do not fit into that range are truncated and will be mapped to characters in that range.

This is basically ASCII or more appropriately https://en.wikipedia.org/wiki/ISO/IEC_8859-1.

In terms of encoding the buffer, the buffer is already single bytes.

I'm not sure what it means to encode into latin1 string, and then comparing the string during a sort when it says it uses UTF16 code points.

Reading this: https://kevin.burke.dev/kevin/node-js-string-encoding/ means that JS strings are always encoded with UTF16. However the runtime appears to do alot of automatic conversions. So for most inputs into a JS program, it's expected that inputs will be in UTF-8. However internally I believe it is utf16. When you do Buffer.from(s, 'utf8') or Buffer.from(s, 'utf16le') they both work because JS knows that the string is utf16 encoded, and will translate it to utf8 or utf16le on the fly.

How does this impact us? Well when we return a binary string from of an ID. Whatever encoding we choose, we should check that the string length is ultimately 16 to mean 16 bytes, I think this will work because latin1 or binary encoding is 8 bit ascii, and that will cover the full range. I wonder though, if that means the the string will be translated to utf16 and stored as utf16.

During a sort, if it considers the string in utf16 codepoints, then my idea that it would compare on the individual byte numbers isn't how it works. The concern would be whether it would result in a codepoint that is out of order from the bit numbering scheme in the id.

CMCDragonkai commented 2 years ago

Regarding operator overloading, TS has some problems:

This means we get type errors when we try to use then as indexes:

class Left {
  public [Symbol.toPrimitive](hint: 'string' | 'number' | 'default') {
    return 'a';
  }
}

const obj = {};

// @ts-ignore
obj[left] = 1;

Funnily enough the comparison operators work.

It seems the only way is with explicit typecasts like left as unknown as string.

CMCDragonkai commented 2 years ago

One way to work around this is by making an intersection type:

type Id = IdInternal & string;

Then the idea is that we force it with a smart constructor:

function makeId(...args): Id {
  return new IdInternal(...args) as Id;
}

This means from the outside, users of the id will appear like a string.

However type inference will think it also has all the string methods, which it won't, it would only if casted approporiately.

CMCDragonkai commented 2 years ago

Sticking with Uint8Array for now, since we only need to figure out how to encode Uint8Array to binary string https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary.

MatrixAI / js-id

Primitive conversions #5

Specification

Additional context

2 - if we use `Buffer`, that impacts the goal to make ES compliant, but those are separate concerns...

Tasks

MatrixAI / js-id

Primitive conversions #5

Specification

Additional context

2 - if we use Buffer, that impacts the goal to make ES compliant, but those are separate concerns...

Tasks

2 - if we use `Buffer`, that impacts the goal to make ES compliant, but those are separate concerns...