Open jeetiss opened 1 year ago
All of the non-trivial components of spec.json (encoded into include-ens.js) are required for compliant normalization.
// ~7KB
ens_tokenize()
can be tree-shook but not much else.I made include-nf.js separate and optional because you technically could substitute for String.normalize
when you can control the Javascript environment.
But in general, this should be avoided because a mismatched Unicode version is one of the issues normalization identifies. From my tests in 2021-22, modern browsers didn't even implement NFC correctly nor agreed with each other, however it has improved since.
I agree there is a 8/6 blowup factor using base64 to embed a byte vector in a Javascript file however I'm not sure there are better options. Once you go above 7F
, the file will be UTF-8 (which defeats any gains) unless the mime encoding is explicitly 8-bit (which seems unlikely given how Javascript is deployed.)
make.js says:
include-ens [14322 bytes, 68 symbols, 19096 base64]
include-nf [5588 bytes, 65 symbols, 7451 base64]
Before: (14322 + 5588) * (8/6) ≅ 19096 + 7451 = 26547 bytes
Using base11X (128 - illegal ASCII) instead of base64 → ~6.5 bits
After: (14322 + 5588) * (8/6.5) = 24505 bytes
→ 2042
bytes can be saved (minus base11X decoder code size)
I already include a polyfill for base64-decoding so this might be worth it.
The encoder is actually just leftover from a previous version of the library where I didn't know what normalization needed. Now that the spec is stable, it can probably be greatly simplified. At the moment, my encoding process creates a vector of unsigned integers (attached below) that has the following distribution (where I've bucketed all values above 64.)
// from src/make.js
let enc = new Encoder();
enc.write_member(...);
enc.write_mapped(...);
...
console.log(enc.values); // unsigned integer vector
console.log(enc.compressed()); // compressed values byte vector
The goal would be:
hello @adraffy,
I have a couple of ideas on how to make this library little bit smaller:
make it compression friendly: base64 can't be compressed at all, so it should be better to use yenc and remove
encode_arithmetic/decode_arithmetic
logicmake library tree-shaking friendly: current implementation packs all used values into one significant instance and whatever you import you still need the whole instance. maybe it would be better to pack each list individually so that library would be tree-shakable
I'm not sure if these ideas will gain some results, but it is interesting for me to explore. Are you interested in these changes too? Or maybe you tried something in this direction