Open mikolalysenko opened 5 months ago
@dqbd are we considering merging this, we notice some performance issues if the input is large as well
@mikolalysenko have you benchmarked this vs baseline vs wasm? curious for 1, 1000, 1M, 100M tokens.
have you benchmarked this vs baseline vs wasm? curious for 1, 1000, 1M, 100M tokens.
If there is a perf difference at different scales, we should consider toggling algorithms based on length.
There are several open issues noting that in the worst case BPE merge algorithm in js-tiktoken takes quadratic time in the number of input characters for certain pathological inputs.
This PR fixes this problem using a heap to avoid recalculating the ranks of all tokens at each character. This technique should also work for the rust/wasm tokenizer but it seems less important in those cases since the native parsers are already pretty fast.
I also added a new test fixture and an example string which causes pathological behavior.
Related issues:
Note: This should be a mild CVE since an attacker may use this behavior to cause a denial of service against services that check user input with js-tiktoken.