Replace js-tiktoken BPE merge algorithm with faster heap based algorithm

mikolalysenko commented 5 months ago

There are several open issues noting that in the worst case BPE merge algorithm in js-tiktoken takes quadratic time in the number of input characters for certain pathological inputs.

This PR fixes this problem using a heap to avoid recalculating the ranks of all tokens at each character. This technique should also work for the rust/wasm tokenizer but it seems less important in those cases since the native parsers are already pretty fast.

I also added a new test fixture and an example string which causes pathological behavior.

Related issues:

Note: This should be a mild CVE since an attacker may use this behavior to cause a denial of service against services that check user input with js-tiktoken.

huytool157 commented 4 months ago

@dqbd are we considering merging this, we notice some performance issues if the input is large as well

enricoros commented 3 months ago

@mikolalysenko have you benchmarked this vs baseline vs wasm? curious for 1, 1000, 1M, 100M tokens.

jonschlinkert commented 3 months ago

have you benchmarked this vs baseline vs wasm? curious for 1, 1000, 1M, 100M tokens.

If there is a perf difference at different scales, we should consider toggling algorithms based on length.

dqbd / tiktoken

Replace js-tiktoken BPE merge algorithm with faster heap based algorithm #101