yurikhan commented 2 years ago

Hello there!

So today I tried to push a 2.5 megabyte file to my team’s Git server and it said “the pre-receive hook timed out after 120 seconds”.

I asked the CI priests (don’t laugh, it’s their official job description) what the :fire: it could be doing, and they said “the only thing we do on JSON files is this” and supplied a small script involving the Python version of json_minify.

I ran it on my file. It finished in 76 minutes 12 seconds.

So I looked at the algorithm and found the cause.

On the top level, it has a loop going through “interesting” places in the input, where “interesting” is defined as “a potential string delimiter, comment delimiter, or a newline”. It also maintains several flags reflecting the syntax at current position.
If the current syntax is “inside string” and it finds a double quote, it might be a closing delimiter. So it checks if there is an even number (incl. zero) of backslashes preceding the double quote. If so, it is indeed a closing string delimiter.
The way it checks for an even number of backslashes is… curious. It searches for the regular expression \\*$ in the whole substring from position 0 to current. This succeeds always before the end, and matches the longest span of backslashes, but it tries every possible starting position from 0 to the current position, unless the regular expression engine is smart enough to notice the anchor at the end and match backwards. (Spoiler warning: Most of them are not.)

As a proof of concept, I made this change:

         if val == '"' and not (in_multi or in_single):
-            escaped = end_slashes_re.search(string, 0, match.start())
+            pos = match.start()
+            while pos > 0 and string[pos - 1] == '\\':
+                pos -= 1

             # start of string or unescaped quote character to end string
-            if not in_string or (escaped is None or len(escaped.group()) % 2 == 0):  # noqa
+            if not in_string or ((match.start() - pos) % 2 == 0):  # noqa
                 in_string = not in_string
             index -= 1  # include " character in next catch

and it passed all existing tests and processed my 2.5M file in 0.697 seconds. It is definitely a lot less pythonic than the original way, but it sure does the job.

You might want to review all ports where this check is done via the same regexp, and verify which of them exhibit quadratic complexity.

There is nothing peculiar about my JSON file; just generate one that has a moderately large number of double quote characters.

getify commented 2 years ago

This is a great point, thank you for finding it. Just for posterity, would you be willing to post a PR for the python code? I'm going to look at the JS code versions. But I'm not really familiar with the other language ports enough to do those reviews.

getify commented 2 years ago

I think this problem could exist in the JS version, here:

https://github.com/getify/JSON.minify/blob/0827b7fd0659d0f10135f4ade0307a625a45a82d/minify.json.js#L35

Although, the regex is rooted to the end with $, so I was assuming that the regex works backward from the end, rather than from the start.

getify commented 2 years ago

I also wonder if the current regex approach could work fine if we limit the "left hand context" string to a max length, say 250 characters. We could pick a large enough number that we know there wouldn't be a string of \ characters that long, but whatever that fixed number is, it wouldn't be anywhere near as long as the O(n) growth of the length of the string as it gets toward 2.5mb.

getify commented 2 years ago

Now I'm wondering if that regex might work better with + instead of *. Could you possibly check to see if that addresses the quadratic growth issue? I really thought I knew that $ rooted the regex to start its backtracking from the end rather than from the start.

yurikhan commented 2 years ago

Just for posterity, would you be willing to post a PR for the python code?

No problem. I don’t know how your release machinery works so I leave it to you.

the regex is rooted to the end with $, so I was assuming that the regex works backward from the end, rather than from the start.

For some regexen, backward matching might be possible, but greediness, backtracking, backreferences, UTF-8, and any combinations of the above make the general case complex, and I’m not surprised not all implementations bother with that.

Measure.

Now I'm wondering if that regex might work better with + instead of *. Could you possibly check to see if that addresses the quadratic growth issue?

Yes I could, and no it doesn’t.

getify commented 2 years ago

Thanks for checking/validating my questions about the RE.

It seems like maybe the least disruptive (least risky) fix is what I suggested, simply performing the regex check against a fixed chunk of the final X number of characters of the left-hand context... I think 100 characters is plenty large enough heuristically to ensure no non-pathological inputs would be negatively affected.

In JS, I can do:

tmp2 = lc.slice(-100).match(/(\\)*$/);

Could you check to see how that affects the performance, and how it compares to the loop approach you took?

yurikhan commented 2 years ago

In JS, I can do:

tmp2 = lc.slice(-100).match(/(\\)*$/);

Yes, it is slightly faster than doing the backward loop. However, this would make it fail on inputs that contain a string value containing 50 escaped backslashes followed by an escaped double quote.

getify commented 2 years ago

However, this would make it fail on inputs that contain a string value containing 50 escaped backslashes followed by an escaped double quote.

Right, but if we pick a large enough fixed number (100, 250, 1000, whatever), we know we're not going to hit any real (non-pathological) such inputs. It's a performance-optimization heuristic.

What do you think the smallest number is that we're safe there?

If we were really worried about that obscure breakage case, we could have a strategy of checking to see if the matched string is the same length as the fixed number (all filled with \s), and in that crazy corner case, we could try to match again with double the number of leftward characters.

BTW, I'm preferring the regex approach (with some heuristic) over switching to looping logic, not only because I think the regex engine can loop faster than user code can, but because a regex is more "portable" across languages than looping syntax, so it might be easier to port this fix (and for any future ports).

yurikhan commented 2 years ago

What do you think the smallest number is that we're safe there?

I am strictly pro-correctness in this kind of issues. (I don’t even fully accept the premise of the tool — in my opinion, comments are invalid in JSON and if you want comments you should be using YAML at source and re-serialize it to JSON at build time.)

we could have a strategy of checking to see if the matched string is the same length as the fixed number (all filled with \s), and in that crazy corner case, we could try to match again with double the number of leftward characters.

I think that leads to a loop of exponentially increasing fixed numbers and harder to follow code.

because a regex is more "portable" across languages than looping syntax

Only across those that have a regular expression engine built in or readily available, and if you can persuade it to perform.

yurikhan commented 2 years ago

An elegant solution could be a reversed view on the left context, and forward-matching on that view. However, Python does not provide reversed string views, and doing a reverse string copy on each iteration is still quadratic.

getify commented 2 years ago

An elegant solution could be a reversed view on the left context

Yeah I considered that option too, but I agree that it probably will kill us in making all those string copies to reverse. JS doesn't have a string reverse, you'd basically have to create an array and do a reverse on the array (which is in-place), so that would be a full copy.

I don’t even fully accept the premise of the tool — in my opinion, comments are invalid in JSON and if you want comments you should be using YAML at source and re-serialize it to JSON at build time

I appreciate your difference of opinion here, and that despite doubting the validity of this tool, you're still helpfully engaging to improve it! Yay open source! :)

FWIW, I know a lot of people who use this tool not for removing comments but for removing whitespace (to essentially shrink down large JSON) before transmitting over the wire. That's why this tool does both comments and whitespace... it's really more a JSON minifier than a JSON-comment-remover.

I happen to think comments in JSON can be useful when JSON is used for config, and I think the benefits of keeping with JSON instead of YAML are worth the need to minify it. But, anyway, no reason to re-litigate that. I accept your opinion as valid even if I disagree.

I am strictly pro-correctness in this kind of issues

I appreciate that. However, I think we can come up with a pathologically bad (perf wise) example that even an optimized "correct" solution will still explode on. For example, with your suggested looping-backward approach... if there was 2.5mb of consecutive \s, your loop would go all the way backward trying to do all that counting, and I think it might perform just about as poorly (or worse) than the regex engine trying to do all that back-tracking.

2.5mb worth of consecutive \s is exceedingly unlikely. But how much more likely is a string of 1000 or 5000 \s? IMO, any more than about 50 consecutive \s is probably as equally likely as 50,000,000 such consecutive occurrences. My point is, a "strictly correct" solution still has performance gotchas from pathological inputs, even if we optimize it to improve the "regular" case like the one that bit you.

We can and should optimize here. The question is, which cases do we want to care about, and which ones do we not care about? I don't see a way, without a significantly different algorithmic approach here, to not have some performance corner cases.

I think that leads to a loop of exponentially increasing fixed numbers and harder to follow code.

In some ways, I agree. I don't really want the code to include that option. But it would be a way to introduce a stepped-performance solution that still achieves "strict correctness".

Only across those that have a regular expression engine built in or readily available, and if you can persuade it to perform.

That's true, but if we have an environment which can't rely on any sort of reasonable regex performance, I doubt they're going to have any other tool option or algorithm that's going to magically out perform this one to a significant degree.

Here's an observation: the regex only needs to operate on the part of the left-context string that goes back to where the previous match was made.

In the JS version of this code, that index is accessible via .lastIndex on the RE, as seen here:

https://github.com/getify/JSON.minify/blob/0827b7fd0659d0f10135f4ade0307a625a45a82d/minify.json.js#L32

In the python version of the code, IIUC, I think your PR is getting that index via match.start(), as seen here:

https://github.com/getify/JSON.minify/pull/65/commits/631e8b72ea75e9c45c29b16827bedde2a0843cb9#diff-ba6b740aa2c8ab3ba0b1c45306c8f2c9f0eae389c2adbddd005cc54bc0a4592aR46

Actually, we need to track/keep the PREVIOUS value of that, so we know what chunk of the text has just been skipped over from the previous match to this match. So, what if we just limit the RE to only run against that most recent chunk, rather than the whole string back to the absolute start?

In JS:

prevLastIndex = from;
from = tokenizer.lastIndex;

// ..

tmp2 = lc.substring(prevLastIndex).match(/(\\)*$/)

In the general case, I think this will limit the RE to running only against a small portion of the overall string.

The pathological case here is of course if there's a huge long string where no matches have been made for thousands or millions of characters. So we still have the performance downside to consider.

And again, I say that we could cap the distance backward that the substring captures to a max of a few thousand characters and that would be a really good/safe tradeoff.

But I'm not fully decided here. Just exploring "out loud". Happy to hear more input.

yurikhan commented 2 years ago

FWIW, I know a lot of people who use this tool not for removing comments but for removing whitespace (to essentially shrink down large JSON) before transmitting over the wire. That's why this tool does both comments and whitespace... it's really more a JSON minifier than a JSON-comment-remover.

The problem of minifying valid JSON can already be solved by just parsing and re-serializing, e.g.:

import json
import sys

with open(sys.argv[1]) as infile:
    content = infile.read()
print(json.dumps(json.loads(content), ensure_ascii=False, delims=(',', ':')))

or just by running jq as jq -c . filename. As an additional benefit, this unescapes all \uXXXX and \UXXXXXXXX escapes into their UTF-8 representations.

I am strictly pro-correctness in this kind of issues

I appreciate that. However, I think we can come up with a pathologically bad (perf wise) example that even an optimized "correct" solution will still explode on. For example, with your suggested looping-backward approach... if there was 2.5mb of consecutive \s, your loop would go all the way backward trying to do all that counting, and I think it might perform just about as poorly (or worse) than the regex engine trying to do all that back-tracking.

No, it would only do that once for the potential quote delimiter they immediately precede. On the next quote, it will not reach those.

Here's an observation: the regex only needs to operate on the part of the left-context string that goes back to where the previous match was made.

Hey, this is exactly it.

Actually, we need to track/keep the PREVIOUS value of that, so we know what chunk of the text has just been skipped over from the previous match to this match. So, what if we just limit the RE to only run against that most recent chunk, rather than the whole string back to the absolute start?

In JS:
prevLastIndex = from;
from = tokenizer.lastIndex;

// ..

tmp2 = lc.substring(prevLastIndex).match(/(\\)*$/)

You lose some performance by copying the substring here, but you win by not matching repeatedly over the ever-growing left context.

In Python, the match function accepts bounds for the original string, so we don’t even have to copy it.

The pathological case here is of course if there's a huge long string where no matches have been made for thousands or millions of characters. So we still have the performance downside to consider.

No, it’s not pathological. You do have to process that chunk, but you only have to process it once. With the last match tracking, every character in the string is processed exactly once, making the whole algorithm linear again.

getify commented 2 years ago

Did you benchmark this "fix" against your JSON file? How does that compare to your original looping backward approach?

getify commented 2 years ago

In Python, the match function accepts bounds for the original string, so we don’t even have to copy it.

Actually, good point, JS has a version of that: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex

We can set the lastIndex property of the RE to tell it to start its matching from that point. So I can avoid the string copying as well:

prevLastIndex = from;
from = tokenizer.lastIndex;

// ..

lc.lastIndex = prevLastIndex;
tmp2 = lc.match(/(\\)*$/);

yurikhan commented 2 years ago

Did you benchmark this "fix" against your JSON file? How does that compare to your original looping backward approach?

66 (regexp from last match): 32 seconds per 100 executions

65 (direct backward scanning): 22 seconds per 100 executions

Both are a marked improvement over the original 76 minutes per single execution. I’m okay with either one.

getify commented 2 years ago

Thanks so much for your contributions here, @yurikhan. Very helpful!

This seems like the best solution we've come up with, and hopefully the most portable. Some language ports may be "limited" in their regex features, to require, for example, the string copying. But I think all implementations will be able to track the previous match index to limit the left-context length, thereby (as you've pointed out) keeping the whole algorithm much more linear.

I'm going to put in the fix for the JS, and approve your PR for the python. Not sure what to do with the other ports, but hopefully the maintainers of those branches are paying attention here and will pick up and apply the fixes to their own languages. If I don't see those addressed within the next few months, I will try my hand at adjusting those ports (though I won't really have as much language expertise or even ability to test).

getify commented 2 years ago

@yurikhan FYI: https://github.com/getify/JSON.minify/blob/a5ea0ba32d61b3a27fc0b27d37f32bfdc21c1687/package.json#L16-L19 :) Thanks!

getify commented 2 years ago

OK, I posted candidate PRs for the existing implementations. I think they're correct ~~(except for the Perl one, that I know isn't yet correct)~~.

But I need folks familiar with PHP, Perl, Objective-C, and Java to review these four PRs and validate (run the tests!!!!), complete, or fix any problems I've created:

Java #67
Perl #68 ~~(needs to resolve a FIXME comment, as noted)~~
PHP #69
Objective-C #70

If anyone listening can help review these PRs, or know someone who can, please do!

getify commented 2 years ago

Looking more closely at the performance characteristics, and benchmarking against an actual (large'ish) test file, per https://github.com/getify/JSON.minify/pull/52#issuecomment-996742477

It seems that the JS (and thus Node) version is still performance buggy, so... more work to do here.

getify commented 2 years ago

Unfortunately, the way JS applies its lastIndex trick for starting a regex from a different location than the start, it doesn't work if the string is changing each time, which it is in this case. So that performance fix wasn't actually doing anything for JS/Node versions here. So I've pivoted to using the substring copying approach, and it's a massive speedup on the test file I mentioned above (from 5 seconds down to 34 ms!!!).

@yurikhan I also tweaked the regexes to use + instead of * which I think makes them a tiny bit more efficient (see the commits cross-referenced above). I recommend making those little regex tweaks to the python branch, just for good measure. I'm also going to update the Java/Perl/PHP/ObjC branches accordingly.

yurikhan commented 2 years ago

@yurikhan I also tweaked the regexes to use + instead of * which I think makes them a tiny bit more efficient (see the commits cross-referenced above). I recommend making those little regex tweaks to the python branch, just for good measure. I'm also going to update the Java/Perl/PHP/ObjC branches accordingly.

I did not really sign up for Python port maintainership and I consider my earlier PR satisfactorily fixes the original bug. If you want to tweak regexen to slightly improve the performance further, it might make sense if you did that in all ports, and should probably be a different issue.

(I also wonder that you are matching for an arbitrary sequence of backslashes and then separately test the match length for being even or odd. Would it be better or worse to check specifically for an odd-length sequence like [^\\]\$\\\$*$?)

getify / JSON.minify

Algorithm is accidentally quadratic #64

66 (regexp from last match): 32 seconds per 100 executions

65 (direct backward scanning): 22 seconds per 100 executions