enkimute / shorty.js

:fast_forward: Efficient compression of streams of short JSON strings
MIT License
25 stars 4 forks source link

This code seems not working properly. #2

Open undefined-moe opened 1 year ago

undefined-moe commented 1 year ago

image

Payloads:

{"html":"<tr data-rid=\"6374a0646ad2a099060d0d4a\">\n  <td class=\"col--status record-status--border  pass\">\n  <div class=\"col--status__text\">\n    <span class=\"icon record-status--icon pass\"></span>\n    <a \n      href=\"/d/system/record/6374a0646ad2a099060d0d4a\"\n      class=\"record-status--text pass\"\n    >\n      <span style=\"color: #25ad40\">100</span>\n      Accepted\n    </a>\n  </div>\n  </td>\n\n  <td class=\"col--problem col--problem-name\">\n      <form class=\"form--inline\" method=\"post\" action=\"/d/system/record/6374a0646ad2a099060d0d4a\">\n      <input type=\"hidden\" name=\"operation\" value=\"rejudge\">\n      <input type=\"hidden\" name=\"csrfToken\" value=\"\">\n      <button type=\"submit\" class=\"link text-maroon lighter\">\n        <span class=\"icon icon-refresh\"></span>\n        ÍK\n      </button> |\n    </form>\n              <a href=\"/p/H1000\"><b>H1000</b>&nbsp;&nbsp;A + B Problem</a>\n        </td>\n  <td class=\"col--submit-by\">\n    <a href=\"/user/13196\">\n              má\n          </a>\n  </td>\n  <td class=\"col--time\">20ms</td>\n  <td class=\"col--memory\">384 KiB</td>\n  <td class=\"col--lang\">C++11</td>\n  <td class=\"col--submit-at\"><span class=\"time relative\" data-timestamp=\"1668587620\">2022-11-16 16:33:40</span></td>\n</tr>\n"}

{"html":"<tr data-rid=\"6374aba54d63a498cb187c6a\">\n  <td class=\"col--status record-status--border  fail\">\n  <div class=\"col--status__text\">\n    <span class=\"icon record-status--icon fail\"></span>\n    <a \n      href=\"/d/system/record/6374aba54d63a498cb187c6a\"\n      class=\"record-status--text fail\"\n    >\n      <span style=\"color: #ff4f4f\">0</span>\n      Compile Error\n    </a>\n  </div>\n  </td>\n\n  <td class=\"col--problem col--problem-name\">\n      <form class=\"form--inline\" method=\"post\" action=\"/d/system/record/6374aba54d63a498cb187c6a\">\n      <input type=\"hidden\" name=\"operation\" value=\"rejudge\">\n      <input type=\"hidden\" name=\"csrfToken\" value=\"\">\n      <button type=\"submit\" class=\"link text-maroon lighter\">\n        <span class=\"icon icon-refresh\"></span>\n        重测\n      </button> |\n    </form>\n              <a href=\"/p/H1000\"><b>H1000</b>&nbsp;&nbsp;A + B Problem</a>\n      <span class=\"text-orange\">(自测)</span>  </td>\n  <td class=\"col--submit-by\">\n    <a href=\"/user/8265\">\n              \n          </a>\n  </td>\n  <td class=\"col--time\">0ms</td>\n  <td class=\"col--memory\">0 Bytes</td>\n  <td class=\"col--lang\">C++20(O2)</td>\n  <td class=\"col--submit-at\"><span class=\"time relative\" data-timestamp=\"1668590501\">2022-11-16 17:21:41</span></td>\n</tr>\n"}
kungfooman commented 9 months ago

It happens because of �má:

image

Test code:

"abc�má".split('').map(_ => _.charCodeAt(0))

65533 is a so called Unicode replacement character: https://stackoverflow.com/questions/41462598/recieving-65533-as-char-value-for-characters-as-%C3%A0-%C3%98-%C3%A6-%C3%A6-etc

So it seems like something fishy happened to your encoding... anyway, I think Shorty should either warn about it or handle it gracefully. Right now it just rewrites the character:

import {Shorty} from './src/Shorty.js';
/**
 * @param {string} original 
 * @param {number} [tokensize]
 */
function test(original, tokensize) {
  const deflated = new Shorty(tokensize).deflate(original);
  const inflated = new Shorty(tokensize).inflate(deflated);
  console.log("original", original);
  console.log("deflated", deflated);
  console.log("inflated", inflated);
  console.log("should be true", original === inflated);
  console.log('#####');
}
test("First test string, hello world! Hello world? Hello world!");
test(`<span class=\"icon icon-refresh\"></span> �má ÍK </span>`);

Output:

image

undefined-moe commented 9 months ago

Really thanks for pointing that out. Thought processing input and output string as ArrayBuffer to skipping those weird unicodes might help.
Will dig into the code and try to fix this later :)

undefined-moe commented 9 months ago

Ah I found the actual issue, and the impact is much more than I thought.

shorty assumes that there's always 8 bit data in a single char, while for many non-english characters, it's not, and shorty will ignore all high bits.

image

image

After encoding the input and output string, it finally worked.

And here's modified code:

class Shorty {
    data: string;
    arrayBuffer?: Uint8Array;
    curpos = 0;
    bitCount = 7;
    bitChar = 0;
    nodes: any[] = [];
    nyt = 0;
    nodecount = 0;

    constructor(public tokensize = 10) {
        this.reset(true);
    }

    reset(full = false) {
        if (full === true) {
            this.nodes = [{ up: 0, weight: 0 }];
            this.nyt = 0;
            this.nodecount = 0;
        }
        this.data = '';
        this.curpos = 0;
        this.bitCount = 7;
        this.bitChar = 0;
    }

    findNode(x) {
        for (let i = this.nodes.length - 1; i > 0; i--) {
            if (typeof this.nodes[i].symbol !== 'undefined' && this.nodes[i].symbol === x) return i;
        }
        return 0;
    }

    addNode(token) {
        if (this.nodecount >= 2046) return 0;
        this.nodes[++this.nodecount] = { up: this.nyt, symbol: token, weight: 1 };
        this.nodes[++this.nodecount] = { up: this.nyt, weight: 0 };
        this.nodes[this.nyt].weight += 1;
        this.nyt = this.nodecount;
        if (this.nodes[this.nodecount - 2].up !== this.nodecount - 2) this.balanceNode(this.nodes[this.nodecount - 2].up);
        return this.nodecount - 2;
    }

    swapNode(a, b) {
        const t = this.nodes[a].symbol;
        const u = this.nodes[b].symbol;
        const v = this.nodes[a].weight;
        this.nodes[a].symbol = u;
        this.nodes[b].symbol = t;
        this.nodes[a].weight = this.nodes[b].weight;
        this.nodes[b].weight = v;
        for (let n = this.nodes.length - 1; n > 0; n--) {
            if (this.nodes[n].up === a) this.nodes[n].up = b;
            else if (this.nodes[n].up === b) this.nodes[n].up = a;
        }
    }

    balanceNode(node) {
        while (true) {
            let minnr = node;
            const weight = this.nodes[node].weight;
            while (minnr > 1 && this.nodes[minnr - 1].weight === weight) minnr--;
            if (minnr !== node && minnr !== this.nodes[node].up) {
                this.swapNode(minnr, node);
                node = minnr;
            }
            this.nodes[node].weight++;
            if (this.nodes[node].up === node) return;
            node = this.nodes[node].up;
        }
    }

    emitNode(node) {
        const emit = [];
        while (node !== 0) {
            emit.unshift(node % 2);
            node = this.nodes[node].up;
        }
        for (let e = 0; e < emit.length; e++) this.emitBit(emit[e]);
    }

    emitNyt(token) {
        this.emitNode(this.nyt);
        const ll = token.length - 1;
        if (this.tokensize > 8) this.emitBit(ll & 8);
        if (this.tokensize > 4) this.emitBit(ll & 4);
        if (this.tokensize > 2) this.emitBit(ll & 2);
        if (this.tokensize > 1) this.emitBit(ll & 1);
        for (let cc = 0; cc < token.length; cc++) { this.emitByte(token.charCodeAt(cc)); }
        return this.nyt;
    }

    readNode(): number[] {
        if (this.nyt === 0) {
            let len = ((this.tokensize > 8) ? this.readBit() * 8 : 0) + ((this.tokensize > 4) ? this.readBit() * 4 : 0) + ((this.tokensize > 2) ? this.readBit() * 2 : 0) + ((this.tokensize > 1) ? this.readBit() : 0) + 1;
            const stream: number[] = [];
            while (len--) stream.push(this.readByte());
            return stream;
        }
        let node = 0;
        while (true) {
            const bit = this.readBit();
            if (this.nodes[node].symbol === undefined) {
                for (let m = 0; ; m++) {
                    if (this.nodes[m].up === node && m !== node && ((m % 2) === bit)) { node = m; break; }
                }
            }
            if (this.nodes[node].symbol !== undefined || this.nodes[node].weight === 0) {
                if (this.nodes[node].weight) return this.nodes[node].symbol;
                let len = ((this.tokensize > 8) ? this.readBit() * 8 : 0) + ((this.tokensize > 4) ? this.readBit() * 4 : 0) + ((this.tokensize > 2) ? this.readBit() * 2 : 0) + ((this.tokensize > 1) ? this.readBit() : 0) + 1;
                const stream: number[] = [];
                while (len--) stream.push(this.readByte());
                return stream;
            }
        }
    }

    emitBit(bit) {
        if (bit) this.bitChar += 1 << this.bitCount;
        if (--this.bitCount < 0) {
            this.data += String.fromCharCode(this.bitChar);
            this.bitCount = 7;
            this.bitChar = 0;
        }
    }

    emitByte(byte) {
        for (let i = 7; i >= 0; i--) this.emitBit(byte >> i & 1);
    }

    readBit() {
        if (this.curpos === this.data.length * 8) throw ('done');
        const bit = this.data.charCodeAt(this.curpos >> 3) >> (7 - this.curpos & 7) & 1;
        this.curpos++;
        return bit;
    }

    readByte() {
        let res = 0;
        for (let i = 0; i < 8; i++) res += (128 >> i) * this.readBit();
        return res;
    }

    deflate(data: string) {
        let token;
        this.arrayBuffer = new TextEncoder().encode(data);
        const l = this.arrayBuffer.length;
        let i;
        let x;
        this.reset();
        for (i = 0; i < l; i++) {
            token = String.fromCharCode(this.arrayBuffer[i]);
            if (this.tokensize > 1) {
                if (/[a-zA-Z]/.test(token)) {
                    while ((i + 1) < l && token.length < this.tokensize && /[a-zA-Z]/.test(String.fromCharCode(this.arrayBuffer[i + 1]))) {
                        token += String.fromCharCode(this.arrayBuffer[++i]);
                    }
                } else if (/[=[\],.:"'{}]/.test(token)) {
                    while ((i + 1) < l && token.length < this.tokensize && /[=[\],.:"'{}]/.test(String.fromCharCode(this.arrayBuffer[i + 1]))) {
                        i++;
                        token += String.fromCharCode(this.arrayBuffer[i]);
                    } // joe hl patch "
                }
            }
            x = this.findNode(token);
            if (!x) {
                this.emitNyt(token);
                x = this.addNode(token);
            } else {
                this.emitNode(x);
                this.balanceNode(x);
            }
        }
        if (this.bitCount !== 7) {
            const oldlength = this.arrayBuffer.length;
            this.emitNode(this.nyt);
            if (oldlength === this.arrayBuffer.length) this.emitByte(0);
        }
        return this.data;
    }

    inflate(data: string) {
        this.reset();
        this.data = data;

        const output: number[] = [];
        try {
            while (true) {
                const token = this.readNode();
                output.push(...token);
                const node = this.findNode(token);
                if (!node) this.addNode(token);
                else this.balanceNode(node);
            }
        } catch (e) { }
        return new TextDecoder().decode(new Uint8Array(output));
    }
}

/**
 * @param {string} original
 * @param {number} [tokensize]
 */
function test(original, tokensize) {
    const deflated = new Shorty(tokensize).deflate(original);
    const inflated = new Shorty(tokensize).inflate(deflated);
    console.log('original', original);
    console.log('deflated', deflated);
    console.log('inflated', inflated);
    console.log('should be true', original === inflated);
    console.log('#####');
}
test('First test string, hello world! Hello world? Hello world!');
test('<span class=\"icon icon-refresh\"></span> �má ÍK 测试 123</span>');
kungfooman commented 9 months ago

Thank you for fixing and new test case :+1:

Btw you can have types without needed a build step for quicker testing by simply using JSDoc comments and a jsconfig.json file. I prepared it here, no error in strict mode:

https://github.com/kungfooman/shorty.js/blob/master/src/Shorty.js

undefined-moe commented 9 months ago

editor won't report errors for jsdoc type mismatch so I would still prefer typescript.

I'm currently using node -r @hydrooj/utils/lib/register code.ts so there's no need to compile the code using tsc (loader will use esbuild which would be much faster) and don't need to worry about loading esm modules in commonjs code :)

BTW I've published a npm package for this: https://www.npmjs.com/shorty.js

kungfooman commented 9 months ago

editor won't report errors for jsdoc type mismatch so I would still prefer typescript.

Of course it will:

image

Are you using VSCode? The only thing you need is a jsconfig.json, I added it in strict mode:

https://github.com/kungfooman/shorty.js/blob/master/jsconfig.json

BTW I've published a npm package for this: https://www.npmjs.com/shorty.js

It's bad practise to not ship ESM in 2024 and .ts should be ignored in package.

But nice work fixing the error :+1:

undefined-moe commented 9 months ago

Ah don't know about jsconfig before, thanks for that information :D

I've published a new version with esm support, the ts file is for typings and sourcemaps (I am really annoyed seeing stack traces mapped to ts files and it's not ctrl+clickable) so I'd prefer publish source codes along with compiled ones.


I hate esm because it made me impossible to implement things like hot module replacement (on backend), redirecting require.resolve, etc, so it's impossible for me to fully switch to esm, also can't imagine you have to write code like

const foo = require('foo');
const bar = await import('bar');

while top-level-await wasn't supported by cjs (all sindresorhus's package is now esm only, WHY DO THIS???), it's kind of enforcing that forces you use esm, and that's why I created that helper loader to handle all this kind of stuff.