Open StringManolo opened 3 years ago
I would guess v8 is optimizing the code and executing it at compile-time (the same program with a console.log is similarly fast). Quickjs runs it on my machine in 1.071 seconds. An equivalent ruby program is executed in 1.209 seconds, so I think it's more a testament to how good v8 is than quickjs being slow - execution speeds are comparable to other non-jit interpreters.
So this is kind of interesting. At first I thought this is because QuickJS lacks the unique-reference in-place concatenation optimisation. As it turns out, not quite: the optimisation is already there, but a number of implementation flaws (and some design flaws) in QuickJS prevent it from being actually useful, at least in interpreted code.
The problems are as follows:
No opcode for in-place concatenation of lexically-scoped variables
QuickJS compiles JavaScript functions into bytecode, which is then interpreted by a stack-based virtual machine. The central performance bottleneck is obviously result += str1 + str2;
. This is the unoptimised bytecode this statement compiles to:
;; result += str1 + str2;
42: get_loc_check 0: result
get_var str1
get_var str2
add
add
dup
put_loc_check 0: result
drop
The get_loc_check
opcode puts the value of a lexically-scoped variable on the stack, while put_loc_check
removes a value from the stack and puts it in a lexically-scoped variable. The equivalents for function-scoped (var
) variables are get_loc
and put_loc
. Values are generally reference-counted (with cycle detection): putting a value on the stack increments its reference count, removing it decrements it. As such, the string cannot be concatenated in-place, because there are two references to it: one from the stack, the other from the local variable. We would like to have an opcode which can update the variable in place, without incrementing the reference count as an intermediate step. Such an opcode exists for var
variables: its name is add_loc
. There is no equivalent add_loc_check
for let
variables.
Simplistic optimiser performs peephole analysis only
We can avoid the above flaw by declaring result
with the var
keyword. But all this does is change get_loc_check
and put_loc_check
into get_loc
and put_loc
. An optimisation step is necessary to transform the opcode stream to use add_loc
.
QuickJS does, in fact, have an optimiser. It is able to simplify the dup
/ put_loc
/ drop
sequence into just a put_loc
. To be able to use add_loc
, in full generality it ought to match on get_loc
(X) / {Y} / add
/ put_loc
(X) and transform it to {Y} / add_loc
(X), for (Y) an arbitrary sequence of opcodes that touches neither the local variable (X) itself nor the stack slot in which its value is put by get_loc
. The optimiser is nowhere near that sophisticated; it can only perform bounded-lookahead pattern matching against the opcode stream. There is no data flow analysis, just dumb substitution of fixed opcode patterns.
We can avoid this flaw by putting the concatenation of str1
and str2
in an intermediate variable. Adding a variable to another variable generates an opcode sequence get_loc
(X) / get_loc
(Y) / add
/ put_loc
(X) that QuickJS optimises into get_loc
(Y) / add_loc
(X):
;; var tmp = str1 + str2;
get_var str1
get_var str2
add
put_loc2 2: tmp
;; result += tmp;
get_loc2 2: tmp
add_loc 0: result
Spurious internal reference-count increments
Using the add_loc
opcode avoids putting the string on the stack, so you would think that it would avoid incrementing the reference count. Except that the opcode’s implementation internally increments the reference count anyway:
op1 = JS_ConcatString(ctx, JS_DupValue(ctx, *pv), op1);
if (JS_IsException(op1))
goto exception;
set_value(ctx, pv, op1);
While the call to JS_ConcatString
is active, there are two references to the string: one from the temporary (created by JS_DupValue
), the other from the variable. JS_ConcatString
drops the former, set_value
drops the latter.
We can address this with a kludge:
op1 = JS_ConcatString(ctx, *pv, op1);
if (JS_IsException(op1))
goto exception;
*pv = op1;
Kind of ugly, but it gets the job done. Now the reference count stays at 1 the whole time.
No memory allocation amortisation
Now the punchline: even if the reference count is 1, this may still not be enough to reuse the allocation of one of the strings.
str = js_malloc_rt(rt, sizeof(JSString) + (max_len << is_wide_char) + 1 - is_wide_char);
When allocating memory for strings, QuickJS requests no more space than required to store the string. The allocator is allowed to provide more storage; when QuickJS detects this, the allocation is reused. But far too often, there simply isn’t enough room, and the opportunity never arises. This triggers the slow path where a new string is allocated and the contents are copied manually: not even js_realloc
is used to reuse the old object’s data.
By adding js_realloc
on top of the above patches, I managed to get execution time to drop on my machine from nearly 6 seconds to about 300 ms. (Your performance may vary: I compile QuickJS either with TinyCC or with ASan and UBSan enabled, either of which may slow things down.)
Good analysis. Our conclusion at the time was that it was not worth continuing on this track because there will still be corner cases which will give a large slowdown. Instead, a rope based string representation may be a simpler solution.
On 8/11/21 1:40 PM, Felix S. wrote:
So this is kind of interesting. At first I thought this is because QuickJS lacks the unique-reference in-place concatenation optimisation https://blog.ganssle.io/articles/2019/11/string-concat.html. As it turns out, not quite: the optimisation is already there, but a number of implementation flaws (and some design flaws) in QuickJS prevent it from being actually useful, at least in interpreted code.
The problems are as follows:
*
*No opcode for in-place concatenation of lexically-scoped variables* QuickJS compiles JavaScript functions into bytecode, which is then interpreted by a stack-based virtual machine. The central performance bottleneck is obviously |result += str1 + str2;|. This is the unoptimised bytecode this statement compiles to: |;; result += str1 + str2; 42: get_loc_check 0: result get_var str1 get_var str2 add add dup put_loc_check 0: result drop | The |get_loc_check| opcode puts the value of a lexically-scoped variable on the stack, while |put_loc_check| removes a value from the stack and puts it in a lexically-scoped variable. The equivalents for function-scoped (|var|) variables are |get_loc| and |put_loc|. Values are generally reference-counted (with cycle detection): putting a value on the stack increments its reference count, removing it decrements it. As such, the string cannot be concatenated in-place, because there are two references to it: one from the stack, the other from the local variable. We would like to have an opcode which can update the variable in place, without incrementing the reference count as an intermediate step. Such an opcode exists for |var| variables: its name is |add_loc|. There is no equivalent |add_loc_check| for |let| variables.
*
*Simplistic optimiser performs peephole analysis only* We can avoid the above flaw by declaring |result| with the |var| keyword. But all this does is change |get_loc_check| and |put_loc_check| into |get_loc| and |put_loc|. An optimisation step is necessary to transform the opcode stream to use |add_loc|. QuickJS does, in fact, have an optimiser. It is able to simplify the |dup| / |put_loc| / |drop| sequence into just a |put_loc|. To be able to use |add_loc|, in full generality it ought to match on |get_loc| (X) / {Y} / |add| / |put_loc| (X) and transform it to {Y} / |add_loc| (X), for (Y) an arbitrary sequence of opcodes that touches neither the local variable (X) itself nor the stack slot in which its value is put by |get_loc|. The optimiser is nowhere near that sophisticated; it can only perform bounded-lookahead pattern matching against the opcode stream. There is no data flow analysis, just dumb substitution of fixed opcode patterns. We can avoid this flaw by putting the concatenation of |str1| and |str2| in an intermediate variable. Adding a variable to another variable generates an opcode sequence |get_loc| (X) / |get_loc| (Y) / |add| / |put_loc| (X) that QuickJS optimises into |get_loc| (Y) / |add_loc| (X): |;; var tmp = str1 + str2; get_var str1 get_var str2 add put_loc2 2: tmp ;; result += tmp; get_loc2 2: tmp add_loc 0: result |
*
*Spurious /internal/ reference-count increments* Using the |add_loc| opcode avoids putting the string on the stack, so you would think that it would avoid incrementing the reference count. Except that the opcode’s implementation internally increments the reference count anyway: op1 = JS_ConcatString(ctx, JS_DupValue(ctx, *pv), op1); if (JS_IsException(op1)) goto exception; set_value(ctx, pv, op1); While the call to |JS_ConcatString| is active, there are two references to the string: one from the temporary (created by |JS_DupValue|), the other from the variable. |JS_ConcatString| drops the former, |set_value| drops the latter. We can address this with a kludge: op1 = JS_ConcatString(ctx, *pv, op1); if (JS_IsException(op1)) goto exception; *pv = op1; Kind of ugly, but it gets the job done. Now the reference count stays at 1 the whole time.
*
*No memory allocation amortisation* Now the punchline: even if the reference count is 1, this may still not be enough to reuse the allocation of one of the strings. str = js_malloc_rt(rt,sizeof(JSString) + (max_len << is_wide_char) +1 - is_wide_char); When allocating memory for strings, QuickJS requests no more space than required to store the string. The allocator is allowed to provide more storage; when QuickJS detects this, the allocation is reused. But far too often, there simply isn’t enough room, and the opportunity never arises. This triggers the slow path where a new string is allocated and the contents are copied manually: not even |js_realloc| is used to reuse the old object’s data. By adding |js_realloc| on top of the above patches, I managed to get execution time to drop on my machine from nearly 6 seconds to about 300 ms. (Your performance may vary: I compile QuickJS either with TinyCC or with ASan and UBSan enabled, either of which may slow things down.)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bellard/quickjs/issues/67#issuecomment-896754687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRQQIERDTW54YJMARWGZELT4JOUNANCNFSM45SZ7GCA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.
I've been in awe at the excellent performance of quickjs, I really appreciate when performance emerges from simplicity and lightness, as opposed to complex layers upon layers of JIT and runtime analysis.
The only issue I'm running into is the fact that a good number of JS programmers assume that string concatenation is optimized. An example of this is the official Pug parser. A 1,000-line Pug file takes over 55 seconds to parse in quickjs versus ~100 ms in Node/V8.
Consider this a "+1" for this issue, I suppose! I love quickjs and I'd love to be able to trust it will work on any codebase, but until someone has time to implement the rope-based string representation I'll have to restrict my usage to the programs that won't exhibit problematic slowdowns. Thanks again for quickjs!
Doing tests I found that using +=
individually is faster see changes bellow and output:
/usr/bin/time qjs joinStrings.js
Config:
- Running 30000 times
Results:
strPlusStr -> 3732ms
strPlusStr2 -> 1689ms
strToArrayPush -> 8ms
strPush -> 5ms
strConcatStr -> 8ms
2.41user 3.03system 0:05.44elapsed 99%CPU (0avgtext+0avgdata 6880maxresident)k
0inputs+0outputs (0major+2113473minor)pagefaults 0swaps
joinStrings.js:
const str1 = "abc";
const str2 = "defghijklmnñopqrstuvwxyz";
const strPlusStr = times => {
let str = "";
for(let i = 0; i < times; ++i) {
str += str1 + str2;
}
return str;
}
const strPlusStr2 = times => {
let str = "";
for(let i = 0; i < times; ++i) {
str += str1;
str += str2;
}
return str;
}
const strToArrayPush = times => {
let str = [];
for(let i = 0; i < times; ++i) {
str.push(str1);
str.push(str2);
}
return str.join("");
}
const strPush = times => {
let str = [];
for(let i = 0; i < times; ++i) {
str.push(str1, str2);
}
return str.join("");
}
const strConcatStr = times => {
let str = [];;
for (let i = 0; i < times; ++i) {
str.push(str1.concat(str2));
}
return str.join("");
}
const _ = console.log;
//const _ = alert; // For Android Browser Testing
const test = times => {
let time;
let results = [];
time = new Date();
strPlusStr(times);
results.push(new Date() - time);
time = new Date();
strPlusStr2(times);
results.push(new Date() - time);
time = new Date();
strToArrayPush(times);
results.push(new Date() - time);
time = new Date();
strPush(times);
results.push(new Date() - time);
time = new Date();
strConcatStr(times);
results.push(new Date() - time);
_(`
Config:
- Running ${times} times
Results:
strPlusStr -> ${results[0]}ms
strPlusStr2 -> ${results[1]}ms
strToArrayPush -> ${results[2]}ms
strPush -> ${results[3]}ms
strConcatStr -> ${results[4]}ms
`);
}
/*
test(1)
test(10)
test(100)
test(1000)
test(10000)
test(100000) */
//test(1000000)
test(30000); // qjs bug on strPlusStr function
/* > qjs joinStrings.js
Config:
- Running 30000 times
Results:
strPlusStr -> 8100ms HERE
strToArrayPush -> 19ms
strConcatStr -> 24ms
> node joinStrings.js
Config:
- Running 30000 times
Results:
strPlusStr -> 13ms THIS IS ALL GOOD
strToArrayPush -> 14ms
strConcatStr -> 13ms
*/
Hello @fstirlitz I didn't saw your changes in your fork at https://github.com/fstirlitz/quickjs , could you please attach or copy and paste your diff/patch
here ?
@bellard was it fixed in 2024.Jan release?
Tested with hermes engine. They do optimize +=
on string. However like @bellard said, there're corner cases +=
optimization won't work. For example, changing str = str + str1
to str = str1 + str
(a rarely seen case in a loop I think), the hermes engine took 16778ms to complete vs qjs took 2407ms on my PC.
I agree the rope based string approach may be a correct direction for general string optimization.
The optimization is undoubtedly required, for example:
let longText;
...
// To append a newline to the end of long text, we have to duplicate the whole text.
// The time complexity is O(n) for appending one character, which is unacceptable!
longText += '\n';
// The below example takes 66531ms on my PC!
// Hermes finishs in 142ms
// And I think quickjs will behave better with similar optimization
let str = '';
for (let i = 0; i < 1000_000; i++) {
str += 'a';
}
But do we need a general string optimization and does that perform better in practice?
Here're my thoughts:
str = str + str1
or str += str1
is a more commonly seen case. Now that string concat is used in almost everywhere, I think it would make a great improvement to the performance even we just optimize for this particular case.Locality Principle
(memory cache).So in my opinions:
WRITE Rope based string is a general optimization approach for writing operations. But appending a value to another string's tail is more commonly seen. READ Linear memory layout performs better for read operations.
I just learn that V8 doesn't adopt rope string too. Maybe we can implement similar optimization on quickjs too.
Here're my brief thoughts.
Four string formats in total.
Let's discuss the operations on these formats:
Type 1 Read: It is straight. Concat:
Type 3
otherwise.Slice: Copy the slice range to new string.
Type 2 Read: It is straight. Concat:
Type 1
string. Type 3
otherwise.Slice:
Type 4
otherwise.Type 3
Read: Estimate the cost to read. When it's cheaper to reform to Type 2
then do it. For example:
Concat: Remain Type 3
.
Slice: New string with Type 4
.
Type 4
Read: Do read on its reference(may cause its reference to reform but no change to itself).
Concat: Reform to Type 3
.
Slice: New string with Type 4
.
Here're my brief thoughts.
- For tiny string, inline store to make use of memory cache.
- For large string, use heap.
- For string concats, use concat list to store references.
- For string slices, use slice reference.
Thank you for your summary. Do string slices really deserve optimizing? I am not sure real world code would benefit from such an optimisation, adding more string types is somewhat painful and error prone.
Thanks for reply. I think slice does have some weight in string operations. For example, split
and regexp
. I think these two are commonly used.
I think you are right about it may introduce more difficulty and complexity in code implementation. For example, we need to consider restricting the depth of references in both concat link and slice.
But I think, they can be done progressively. We can do it simpler at first. For example, before reading u.str8
and u.str16
of JSString
struct, we do a preread
operation, which is responsible to turn concat link or slice into normal string so we can directly access from u.str8
and u.str16
. For Inline String
or Heap String
, we can use the pointer directly. In this way, we don't introduce too many changes in read process. As for the write
process, we can do it in both the old way and the new way at first. If we do it in the old way, no changes are needed. If we do it in the new way, preread
will make it right for reading. Anyway, it is progressive.
If simple optimizations are not good enough, we can progressively introduce more deeper optimizations.
Actually, I'm implementing them on my own when I have free time. It's a rough implementation, but I will update the benchmark comparison here after I finish. We will see if it is worthy.
Using concat method or array push this operation takes around 0.015 - 0.030 seconds.
If you want to test, you can run next command:
Notice qjs takes a few seconds to run the code.