Open undefined-moe opened 1 year ago
It happens because of �má
:
Test code:
"abc�má".split('').map(_ => _.charCodeAt(0))
65533 is a so called Unicode replacement character: https://stackoverflow.com/questions/41462598/recieving-65533-as-char-value-for-characters-as-%C3%A0-%C3%98-%C3%A6-%C3%A6-etc
So it seems like something fishy happened to your encoding... anyway, I think Shorty should either warn about it or handle it gracefully. Right now it just rewrites the character:
import {Shorty} from './src/Shorty.js';
/**
* @param {string} original
* @param {number} [tokensize]
*/
function test(original, tokensize) {
const deflated = new Shorty(tokensize).deflate(original);
const inflated = new Shorty(tokensize).inflate(deflated);
console.log("original", original);
console.log("deflated", deflated);
console.log("inflated", inflated);
console.log("should be true", original === inflated);
console.log('#####');
}
test("First test string, hello world! Hello world? Hello world!");
test(`<span class=\"icon icon-refresh\"></span> �má ÍK </span>`);
Output:
Really thanks for pointing that out. Thought processing input and output string as ArrayBuffer to skipping those weird unicodes might help.
Will dig into the code and try to fix this later :)
Ah I found the actual issue, and the impact is much more than I thought.
shorty assumes that there's always 8 bit data in a single char, while for many non-english characters, it's not, and shorty will ignore all high bits.
After encoding the input and output string, it finally worked.
And here's modified code:
class Shorty {
data: string;
arrayBuffer?: Uint8Array;
curpos = 0;
bitCount = 7;
bitChar = 0;
nodes: any[] = [];
nyt = 0;
nodecount = 0;
constructor(public tokensize = 10) {
this.reset(true);
}
reset(full = false) {
if (full === true) {
this.nodes = [{ up: 0, weight: 0 }];
this.nyt = 0;
this.nodecount = 0;
}
this.data = '';
this.curpos = 0;
this.bitCount = 7;
this.bitChar = 0;
}
findNode(x) {
for (let i = this.nodes.length - 1; i > 0; i--) {
if (typeof this.nodes[i].symbol !== 'undefined' && this.nodes[i].symbol === x) return i;
}
return 0;
}
addNode(token) {
if (this.nodecount >= 2046) return 0;
this.nodes[++this.nodecount] = { up: this.nyt, symbol: token, weight: 1 };
this.nodes[++this.nodecount] = { up: this.nyt, weight: 0 };
this.nodes[this.nyt].weight += 1;
this.nyt = this.nodecount;
if (this.nodes[this.nodecount - 2].up !== this.nodecount - 2) this.balanceNode(this.nodes[this.nodecount - 2].up);
return this.nodecount - 2;
}
swapNode(a, b) {
const t = this.nodes[a].symbol;
const u = this.nodes[b].symbol;
const v = this.nodes[a].weight;
this.nodes[a].symbol = u;
this.nodes[b].symbol = t;
this.nodes[a].weight = this.nodes[b].weight;
this.nodes[b].weight = v;
for (let n = this.nodes.length - 1; n > 0; n--) {
if (this.nodes[n].up === a) this.nodes[n].up = b;
else if (this.nodes[n].up === b) this.nodes[n].up = a;
}
}
balanceNode(node) {
while (true) {
let minnr = node;
const weight = this.nodes[node].weight;
while (minnr > 1 && this.nodes[minnr - 1].weight === weight) minnr--;
if (minnr !== node && minnr !== this.nodes[node].up) {
this.swapNode(minnr, node);
node = minnr;
}
this.nodes[node].weight++;
if (this.nodes[node].up === node) return;
node = this.nodes[node].up;
}
}
emitNode(node) {
const emit = [];
while (node !== 0) {
emit.unshift(node % 2);
node = this.nodes[node].up;
}
for (let e = 0; e < emit.length; e++) this.emitBit(emit[e]);
}
emitNyt(token) {
this.emitNode(this.nyt);
const ll = token.length - 1;
if (this.tokensize > 8) this.emitBit(ll & 8);
if (this.tokensize > 4) this.emitBit(ll & 4);
if (this.tokensize > 2) this.emitBit(ll & 2);
if (this.tokensize > 1) this.emitBit(ll & 1);
for (let cc = 0; cc < token.length; cc++) { this.emitByte(token.charCodeAt(cc)); }
return this.nyt;
}
readNode(): number[] {
if (this.nyt === 0) {
let len = ((this.tokensize > 8) ? this.readBit() * 8 : 0) + ((this.tokensize > 4) ? this.readBit() * 4 : 0) + ((this.tokensize > 2) ? this.readBit() * 2 : 0) + ((this.tokensize > 1) ? this.readBit() : 0) + 1;
const stream: number[] = [];
while (len--) stream.push(this.readByte());
return stream;
}
let node = 0;
while (true) {
const bit = this.readBit();
if (this.nodes[node].symbol === undefined) {
for (let m = 0; ; m++) {
if (this.nodes[m].up === node && m !== node && ((m % 2) === bit)) { node = m; break; }
}
}
if (this.nodes[node].symbol !== undefined || this.nodes[node].weight === 0) {
if (this.nodes[node].weight) return this.nodes[node].symbol;
let len = ((this.tokensize > 8) ? this.readBit() * 8 : 0) + ((this.tokensize > 4) ? this.readBit() * 4 : 0) + ((this.tokensize > 2) ? this.readBit() * 2 : 0) + ((this.tokensize > 1) ? this.readBit() : 0) + 1;
const stream: number[] = [];
while (len--) stream.push(this.readByte());
return stream;
}
}
}
emitBit(bit) {
if (bit) this.bitChar += 1 << this.bitCount;
if (--this.bitCount < 0) {
this.data += String.fromCharCode(this.bitChar);
this.bitCount = 7;
this.bitChar = 0;
}
}
emitByte(byte) {
for (let i = 7; i >= 0; i--) this.emitBit(byte >> i & 1);
}
readBit() {
if (this.curpos === this.data.length * 8) throw ('done');
const bit = this.data.charCodeAt(this.curpos >> 3) >> (7 - this.curpos & 7) & 1;
this.curpos++;
return bit;
}
readByte() {
let res = 0;
for (let i = 0; i < 8; i++) res += (128 >> i) * this.readBit();
return res;
}
deflate(data: string) {
let token;
this.arrayBuffer = new TextEncoder().encode(data);
const l = this.arrayBuffer.length;
let i;
let x;
this.reset();
for (i = 0; i < l; i++) {
token = String.fromCharCode(this.arrayBuffer[i]);
if (this.tokensize > 1) {
if (/[a-zA-Z]/.test(token)) {
while ((i + 1) < l && token.length < this.tokensize && /[a-zA-Z]/.test(String.fromCharCode(this.arrayBuffer[i + 1]))) {
token += String.fromCharCode(this.arrayBuffer[++i]);
}
} else if (/[=[\],.:"'{}]/.test(token)) {
while ((i + 1) < l && token.length < this.tokensize && /[=[\],.:"'{}]/.test(String.fromCharCode(this.arrayBuffer[i + 1]))) {
i++;
token += String.fromCharCode(this.arrayBuffer[i]);
} // joe hl patch "
}
}
x = this.findNode(token);
if (!x) {
this.emitNyt(token);
x = this.addNode(token);
} else {
this.emitNode(x);
this.balanceNode(x);
}
}
if (this.bitCount !== 7) {
const oldlength = this.arrayBuffer.length;
this.emitNode(this.nyt);
if (oldlength === this.arrayBuffer.length) this.emitByte(0);
}
return this.data;
}
inflate(data: string) {
this.reset();
this.data = data;
const output: number[] = [];
try {
while (true) {
const token = this.readNode();
output.push(...token);
const node = this.findNode(token);
if (!node) this.addNode(token);
else this.balanceNode(node);
}
} catch (e) { }
return new TextDecoder().decode(new Uint8Array(output));
}
}
/**
* @param {string} original
* @param {number} [tokensize]
*/
function test(original, tokensize) {
const deflated = new Shorty(tokensize).deflate(original);
const inflated = new Shorty(tokensize).inflate(deflated);
console.log('original', original);
console.log('deflated', deflated);
console.log('inflated', inflated);
console.log('should be true', original === inflated);
console.log('#####');
}
test('First test string, hello world! Hello world? Hello world!');
test('<span class=\"icon icon-refresh\"></span> �má ÍK 测试 123</span>');
Thank you for fixing and new test case :+1:
Btw you can have types without needed a build step for quicker testing by simply using JSDoc comments and a jsconfig.json
file. I prepared it here, no error in strict mode:
https://github.com/kungfooman/shorty.js/blob/master/src/Shorty.js
editor won't report errors for jsdoc type mismatch so I would still prefer typescript.
I'm currently using node -r @hydrooj/utils/lib/register code.ts
so there's no need to compile the code using tsc (loader will use esbuild which would be much faster) and don't need to worry about loading esm modules in commonjs code :)
BTW I've published a npm package for this: https://www.npmjs.com/shorty.js
editor won't report errors for jsdoc type mismatch so I would still prefer typescript.
Of course it will:
Are you using VSCode? The only thing you need is a jsconfig.json
, I added it in strict mode:
https://github.com/kungfooman/shorty.js/blob/master/jsconfig.json
BTW I've published a npm package for this: https://www.npmjs.com/shorty.js
It's bad practise to not ship ESM in 2024 and .ts should be ignored in package.
But nice work fixing the error :+1:
Ah don't know about jsconfig before, thanks for that information :D
I've published a new version with esm support, the ts file is for typings and sourcemaps (I am really annoyed seeing stack traces mapped to ts files and it's not ctrl+clickable) so I'd prefer publish source codes along with compiled ones.
I hate esm because it made me impossible to implement things like hot module replacement (on backend), redirecting require.resolve, etc, so it's impossible for me to fully switch to esm, also can't imagine you have to write code like
const foo = require('foo');
const bar = await import('bar');
while top-level-await wasn't supported by cjs (all sindresorhus's package is now esm only, WHY DO THIS???), it's kind of enforcing that forces you use esm, and that's why I created that helper loader to handle all this kind of stuff.
Payloads: