A found in the wild ML benchmark

BurntSushi / rebar

A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.

The Unlicense

225 stars 15 forks source link

A found in the wild ML benchmark #18

Open xd009642 opened 1 month ago

xd009642 commented 1 month ago

Ignoring a lot of the thoughts on LLMs and surrounding technology the creators of this tech are bruteforcing some truly horrifying regexes to make them work as expected in tokenisation. So it might be worth adding some to benchmarks as it's becoming a real world workload for a number of people.

I've not done a massive look for examples, but I was inspired to make this issue after I saw this in the wild:

From https://github.com/huggingface/tokenizers/blob/14a07b06e4a8bd8f80d884419ae4630f5a3d8098/tokenizers/src/tokenizer/mod.rs#L1376C37-L1376C172

(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+

BurntSushi commented 1 month ago

One thing we can do more quickly is to just create a dedicated "machine learning" category (something very generic) where we collect workloads from wherever.

IDK if it yet makes sense to add to the curated set. I want to grow/change that very intentionally. But the curated set is only a tiny slice of the rebar benchmarks. I'm extremely liberal about adding non-curated benchmarks.

xd009642 commented 1 month ago

Where do the non-curated ones go in the repo? I was just looking at benchmarks/regexes/wild initially from browsing.

I can probably spend a few hours of my time to farm some regexes from various packages etc. There's a bunch of hits in https://github.com/search?q=org%3Aopenai%20regex&type=code the tokenisers one I previously shared and there's likely a few in repos like https://github.com/langchain-ai/langchain

BurntSushi commented 1 month ago

It's hard to know what the right categorization is up-front without actually trying something. I'd probably start with ./definitions/wild/ai/langchain.toml or something like that. And then add new TOML files for each project, e.g., ./definitions/wild/ai/huggingface.toml.