Open ufo5260987423 opened 2 weeks ago
@ufo5260987423
Please share your numbers in this thread.
I have a hunch Python has a special representation of strings that consists of ascii characters only. The strings in you benchmark contain ascii characters only. Try timing the following and compare it to your original Python timings.
If so, you can try the corresponding benchmark in Chez Scheme using byte strings.
s = "abcdef"
def grow():
# print(".")
global s
s = "ααα" + s + "ααα" + s + "ααα"
half_length = len(s) // 2
s = s[half_length:] + s[:half_length]
return s
def trial(n):
# print("x")
global s
i = 0
while len(s) <= n:
i += 1
grow()
return len(s)
def myTry(n):
global s
i = 0
while i < 10:
s = "αααααα"
trial(n)
i += 1
return len(s)
Chez Scheme strings are stored as full 32-bit unicode code points. Much of what you are seeing is the difference needed to allocate and GC four times more memory than an implementation that uses UTF-8 encoded strings. The tradeoff is that Chez Scheme offers constant-time access by character offset. Your benchmark has any more operations that depend on the total size of the string than on finding a character offset.
Well, I don't think full 32-bit unicode code points could explain all differences, because in my benchmark, Chez Scheme consume seconds more than 4 times comparing with other counterparts.
Strings are immutable in both Javascript and Python, so "substring" does not need to result in a copy of the relevant portion of the string. Since the inner loop in your benchmark has two substrings in it that could be another significant fraction of the difference. (I have no idea whether substring is "almost zero" allocation in the Javascript and Python implementations; you'll have to figure that out if you want to know whether that contributes to the difference.)
That said, the implementation of substring
would be faster (on my machine) if it used string-copy!
internally (which eventually bottoms out in memcpy
provided by the C runtime, which one would assume is highly optimized) rather than copying the characters itself. Although most of that time difference looks like it's due to less time spent in the GC, which is a little confusing to me since I don't see where anything in that loop would be allocating something to trigger the collector.
I just re-write the code by replace substring
with string-copy!
, however, it doesn't change a lot. (Maybe because my poor programming)
So, could it be ok you share your implementation?
I meant changing the implementation of substring
.
diff --git a/s/5_4.ss b/s/5_4.ss
index 8f5ab069..fe1a4406 100644
--- a/s/5_4.ss
+++ b/s/5_4.ss
@@ -32,10 +32,10 @@
($oops 'substring
"~s and ~s are not valid start/end indices for ~s"
m n s1))
- (let ([s2 ($make-uninitialized-string (fx- n m))])
- (do ([j 0 (fx+ j 1)] [i m (fx+ i 1)])
- ((fx= i n) s2)
- (string-set! s2 j (string-ref s1 i)))))))
+ (let* ([len (fx- n m)]
+ [s2 ($make-uninitialized-string len)])
+ (string-copy! s1 m s2 0 len)
+ s2))))
(let ()
(define do-string-append2
It seems reasonable. I'll have a try to see what happens to my benchmark. If it really works, I'll response in this issue.
Sadly, it doesn't work. Updated Version:
time scheme --script ./src/string/string.scm
real 0m2.182s
user 0m1.007s
sys 0m1.168s
Origin Version:
time scheme --script ./src/string/string.scm
real 0m2.023s
user 0m1.006s
sys 0m1.009s
Interestingly I saw a noticeable improvement using jtaylor's patch. I also wrapped the call to (my-try 5000000) to have chez-scheme handle the profiling instead of using the time command. This avoids polluting the profiling data with start-up time:
# vanilla chez-scheme:
14:37:00 $ scheme --script ../various-program-languages-benchmark/src/string/string.scm
(time (my-try 5000000))
79 collections
4.382938192s elapsed cpu time, including 2.551151570s collecting
4.383891732s elapsed real time, including 2.552344759s collecting
2013317072 bytes allocated, including 1914810496 bytes reclaimed
# using jtaylor's patch:
15:11:08 $ ./ta6le/bin/ta6le/scheme -b ./ta6le/boot/ta6le/scheme.boot --script ../various-program-languages-benchmark/src/string/string.scm
(time (my-try 5000000))
2 collections
2.622198520s elapsed cpu time, including 0.431483824s collecting
2.622673218s elapsed real time, including 0.431739396s collecting
2013266688 bytes allocated, including 1260431552 bytes reclaimed
Thank @gwatt , you're right! @jltaylor-us 's patch works!
time ta6le/bin/ta6le/scheme --script ../various-program-language-benchmark/src/string/string.scm
real 0m1.432s
user 0m0.534s
sys 0m0.888s
And my fault, last test I did something wrong. I'll give out a full benchmark performance report here.
And @jltaylor-us , it's your smart work,will you make a pr contributing your patch? My scheme-langserver has been really bothered by string performance recently...
NOTE:scheme-local-chezscheme is with the patch,its detail is in scheme-local-chezscheme.txt. Other counterparts please refer various-program-languages-benchmark's output directory
Here's the results:
It's really interesting and useful to see a comparison of different dynamic languages.
@ufo5260987423 It might be better to display the machine information in your benchmark results.
I ran the string benchmark without @jltaylor-us 's patch, using ChezScheme Version 10.0.0:
> (time (my-try 5000000))
(time (my-try 5000000))
79 collections
0.792087848s elapsed cpu time, including 0.358229578s collecting
0.793460921s elapsed real time, including 0.358927788s collecting
2013317072 bytes allocated, including 1915282208 bytes reclaimed
8388598
> (time (my-try 5000000))
(time (my-try 5000000))
79 collections
0.738504446s elapsed cpu time, including 0.314506259s collecting
0.739721741s elapsed real time, including 0.315044692s collecting
2013316944 bytes allocated, including 2013463184 bytes reclaimed
8388598
> (time (my-try 5000000))
(time (my-try 5000000))
79 collections
0.768487715s elapsed cpu time, including 0.339881611s collecting
0.769826016s elapsed real time, including 0.340541420s collecting
2013316944 bytes allocated, including 2013463184 bytes reclaimed
8388598
Using time
in my fish shell:
> time scheme --script string.scm
________________________________________________________
Executed in 848.69 millis fish external
usr time 352.20 millis 262.00 micros 351.94 millis
sys time 494.87 millis 34.00 micros 494.84 millis
My CPU is i9-12900HK, perhaps this contributes to the small numbers?
@maoif My machine is a thinkpad t480s. You may find references in detail reports.
Linux ufo-t480s 6.6.44 #1-NixOS SMP PREEMPT_DYNAMIC Sat Aug 3 06:54:42 UTC 2024 x86_64 GNU/Linux
And @jltaylor-us , it's your smart work,will you make a pr contributing your patch? My scheme-langserver has been really bothered by string performance recently...
Unfortunately, it actually makes things worse on small strings. (On my machine, anyway)
There is likely some small string size threshold for switching between the two implementations, but I think we need more data (preferably across a variety of platforms) to make sure we're not optimizing a synthetic benchmark workload at the expense of more common real-world performance.
I'd be curious to know why the default substring implementation causes 79 collection while the string-copy! based implementation causes 2 collections. Subtracting out the time spent in collections brings their runtime approximately the same.
Unfortunately, it actually makes things worse on small strings. (On my machine, anyway)
How small? Maybe I can add a new benchmark.
I'd be curious to know why the default substring implementation causes 79 collection while the string-copy! based implementation causes 2 collections. Subtracting out the time spent in collections brings their runtime approximately the same.
I'm pretty curious about that, myself. My current theory is that the implementation with the larger number of scheme "instructions" is causing the collect request handler to fire more often (approximately when it should based on collect-trip-bytes
), whereas the version where iterating through the length of the (sub)string is "hidden" behind a single primitive accumulates quite a bit more than collect-trip-bytes
before the GC is actually triggered. This seems plausible, but I haven't managed to write examples that I think illustrate this with 100% certainty, and I'm rapidly losing interest for the evening.
Chez Scheme strings are stored as full 32-bit unicode code points. Much of what you are seeing is the difference needed to allocate and GC four times more memory than an implementation that uses UTF-8 encoded strings. The tradeoff is that Chez Scheme offers constant-time access by character offset.
Swift switched to UTF-8 strings while preserving amortized-constant-time access to characters by storing "breadcrumbs" to convert offsets. There is a blog post and a forum post with even-more-internal details. Here's the most relevant part of the blog post:
As we’ve seen, transcoding a string’s entire contents from UTF-16 to UTF-8 or vice-versa can be a costly operation. But, converting a UTF-16 offset to a UTF-8 offset is a very fast linear scan, essentially summing the high-bits on all the bytes. The very first time an API assuming O(1) access to UTF-16 is used on a large string, it performs this scan and leave breadcrumbs at fixed strides so that it can answer subsequent calls in amortized O(1) time.
The breadcrumbs store an Array of string indices and the length of the string in UTF-16 code units. The
i
th breadcrumb corresponds to thei * stride
UTF-16 offset. Mapping a UTF-16 offset to a UTF-8 offset to access our contents starts atbreadcrumbs[offset / stride]
and scans forwards from there. Mapping from a UTF-8 offset to a UTF-16 offset (less common) starts with a reasonable estimate and does a binary search from there to find an upper bound and lower bound for the subsequent scan.Breadcrumb granularity gives us a way to balance between speed and size. Calculating breadcrumbs, their granularity, and even their representation is behind a resilient function call, so all of this can be tweaked and adjusted in the future.
Currently, strings use a very fine granularity, tilting strongly towards speed out of a desire to not introduce unanticipated regressions in any realistic situation. Strings comprised of latter-BMP scalars which have these APIs called on them have a very low memory footprint on the system overall, so memory pressure is not a common concern. As the performance of this scan is improved, granularity can be increased without harming speed.
ASCII is an encoding subset of UTF-16, which means UTF-8 offsets are the same as UTF-16 offsets if the string is entirely in ASCII. All-ASCII strings skip breadcrumbing and just return the answer.
Recently I made a benchmark among javascript, scheme and python here. Actually, I'm glad that Chez Scheme achieves good performances on many items. However, it's really shocking the string operations make Chez Scheme a main hinder: among javascript/bun, javascript/nodejs, python/pypy, python/cpython,scheme/guile, Chez Scheme is the most time-consuming.
I want to know: