getify / JSON.minify

Simple minifier for JSON to remove comments and whitespace
http://web.archive.org/web/20100629021329/http://blog.getify.com/2010/06/json-comments/
410 stars 102 forks source link

Javascript: non-regex version about 16x-20x faster on my machine #52

Closed cegfault closed 2 years ago

cegfault commented 4 years ago

Just discovered JSON.minify; I had written this function a couple years ago. It's a non-regex version that's about 16x-20x faster than your current version (based on tests on my machine).

Thought it might be worthwhile to share. The code is definitely a little ugly and not easy to read (I actually wrote this based on a C-version for a client, and that's pretty obvious looking at the code).

If you decide to use this you'll probably want to change some variable names and clean it up a bit, but the performance speaks for itself :)

getify commented 4 years ago

Does it pass all the tests?

cegfault commented 4 years ago

yes!

Sorry for the delay. Just saw this today. I had a small fix needed for test 3, but yes - now it is passing all tests.

getify commented 3 years ago

Sorry to ask after such a long delay, but... is this still valid to merge? Could you re-verify both the test passing and the performance benchmark? And can you post the performance tests/results in some way so they can be independently verified?

cegfault commented 3 years ago

@getify I just tested again and it's still passing all tests.

I don't have time at the moment to setup a formal performance benchmark, but you could definitely grab a large json file and pipe it through a few thousand times in both versions. Mine will definitely outperform the current version.

getify commented 2 years ago

We've just addressed a significant performance bug. As such, I'd like to re-validate whether the performance issue here is still as significant. Closing for now, but if performance benchmarks can highlight further issues to address, please re-open.

cegfault commented 2 years ago

@getify I don't see how to reopen this, but yes it absolutely should be reopened.

Try testing with Mullvad's wireguard.json file (github doesn't take .json filenames so I renamed to wireguard.txt), which I've attached here: wireguard.txt

You can test performance easily with:

fetch('wireguard.txt').then(x=>x.text()).then(x=>{
  console.log(+new Date)
  var res = JSON.minify(x);
  console.log(+new Date)
})

Your version: 4223ms (4.223 seconds) My version : 13ms (0.013 seconds)

My version is over 324 times faster than yours

cegfault commented 2 years ago

The problem has nothing to do with quadratic performance in your algorithm. It has to do with the performance of regular expressions.

An iterative loop (mine) will outperform regex (yours) every day of the week and twice on sunday. Regex is notoriously slow. So long as you are doing this:

while (tmp = tokenizer.exec(json)) {

... then my version will be many, many times faster than yours.

getify commented 2 years ago

I can't re-open since the repository you submitted it from has been deleted. But I am looking into the performance issues more closely. I appreciate the test file you provided. And I'm sorry we didn't address your concerns earlier. I wasn't willing to accept a PR with a full-rewrite that doesn't handle the ports to all the other languages. Also, we had no formal performance benchmarking, so the assertions of performance improvements were harder to verify. In any case, now that I have that file you provided, I can definitely see performance issues, and I can work on fixing them. And I am.


FWIW, on my machine, the current algorithm takes almost 5 seconds on your provide file, which is... BAD. But if I just comment-out the look-behind regex (the one we were trying to optimize) to prevent it from running, the process takes just 34 ms. That's 150x better, which is of course pretty significant. So, it's CLEARLY actually this look-behind regex that's the big performance culprit here, not just any regex usage.

I have some approaches to fix this. So I'm playing with which option is best.

cegfault commented 2 years ago

No worries. I'll still contend that an iterative loop is always going to be faster than a regex (and less memory-intensive), but concede that a non-lookback version of regex may be good enough for your usage.

I could probably port to the other languages, but I just don't have the time right now.

Good luck either way