Promoting performance on string operations

ufo5260987423 commented 2 weeks ago

Recently I made a benchmark among javascript, scheme and python here. Actually, I'm glad that Chez Scheme achieves good performances on many items. However, it's really shocking the string operations make Chez Scheme a main hinder: among javascript/bun, javascript/nodejs, python/pypy, python/cpython,scheme/guile, Chez Scheme is the most time-consuming.

I want to know:

Why does this happens?
Will the performance on string operations be promoted in the near future?
If you won't, roughly, what should I do to help promote?

soegaard commented 2 weeks ago

@ufo5260987423

Please share your numbers in this thread.

I have a hunch Python has a special representation of strings that consists of ascii characters only. The strings in you benchmark contain ascii characters only. Try timing the following and compare it to your original Python timings.

If so, you can try the corresponding benchmark in Chez Scheme using byte strings.

s = "abcdef"

def grow():
    # print(".")
    global s 
    s = "ααα" + s + "ααα" + s + "ααα"
    half_length = len(s) // 2
    s = s[half_length:] + s[:half_length]
    return s

def trial(n):
    # print("x")
    global s
    i = 0
    while len(s) <= n:
        i += 1
        grow()
    return len(s)

def myTry(n):
    global s
    i = 0
    while i < 10:
        s = "αααααα"
        trial(n)
        i += 1
    return len(s)

jltaylor-us commented 2 weeks ago

Chez Scheme strings are stored as full 32-bit unicode code points. Much of what you are seeing is the difference needed to allocate and GC four times more memory than an implementation that uses UTF-8 encoded strings. The tradeoff is that Chez Scheme offers constant-time access by character offset. Your benchmark has any more operations that depend on the total size of the string than on finding a character offset.

ufo5260987423 commented 1 week ago

Well, I don't think full 32-bit unicode code points could explain all differences, because in my benchmark, Chez Scheme consume seconds more than 4 times comparing with other counterparts.

jltaylor-us commented 1 week ago

Strings are immutable in both Javascript and Python, so "substring" does not need to result in a copy of the relevant portion of the string. Since the inner loop in your benchmark has two substrings in it that could be another significant fraction of the difference. (I have no idea whether substring is "almost zero" allocation in the Javascript and Python implementations; you'll have to figure that out if you want to know whether that contributes to the difference.)

That said, the implementation of substring would be faster (on my machine) if it used string-copy! internally (which eventually bottoms out in memcpy provided by the C runtime, which one would assume is highly optimized) rather than copying the characters itself. Although most of that time difference looks like it's due to less time spent in the GC, which is a little confusing to me since I don't see where anything in that loop would be allocating something to trigger the collector.

ufo5260987423 commented 1 week ago

I just re-write the code by replace substring with string-copy!, however, it doesn't change a lot. (Maybe because my poor programming) So, could it be ok you share your implementation?

jltaylor-us commented 1 week ago

I meant changing the implementation of substring.

diff --git a/s/5_4.ss b/s/5_4.ss
index 8f5ab069..fe1a4406 100644
--- a/s/5_4.ss
+++ b/s/5_4.ss
@@ -32,10 +32,10 @@
             ($oops 'substring
                    "~s and ~s are not valid start/end indices for ~s"
                    m n s1))
-         (let ([s2 ($make-uninitialized-string (fx- n m))])
-           (do ([j 0 (fx+ j 1)] [i m (fx+ i 1)])
-               ((fx= i n) s2)
-             (string-set! s2 j (string-ref s1 i)))))))
+         (let* ([len (fx- n m)]
+                [s2 ($make-uninitialized-string len)])
+           (string-copy! s1 m s2 0 len)
+           s2))))

 (let ()
   (define do-string-append2

ufo5260987423 commented 1 week ago

It seems reasonable. I'll have a try to see what happens to my benchmark. If it really works, I'll response in this issue.

ufo5260987423 commented 1 week ago

Sadly, it doesn't work. Updated Version:

time scheme  --script ./src/string/string.scm

real    0m2.182s
user    0m1.007s
sys 0m1.168s

Origin Version:

time scheme --script ./src/string/string.scm

real    0m2.023s
user    0m1.006s
sys 0m1.009s

gwatt commented 1 week ago

Interestingly I saw a noticeable improvement using jtaylor's patch. I also wrapped the call to (my-try 5000000) to have chez-scheme handle the profiling instead of using the time command. This avoids polluting the profiling data with start-up time:

# vanilla chez-scheme:
14:37:00 $ scheme --script ../various-program-languages-benchmark/src/string/string.scm 
(time (my-try 5000000))
    79 collections
    4.382938192s elapsed cpu time, including 2.551151570s collecting
    4.383891732s elapsed real time, including 2.552344759s collecting
    2013317072 bytes allocated, including 1914810496 bytes reclaimed

# using jtaylor's patch:
15:11:08 $ ./ta6le/bin/ta6le/scheme -b ./ta6le/boot/ta6le/scheme.boot --script ../various-program-languages-benchmark/src/string/string.scm            

(time (my-try 5000000))
    2 collections
    2.622198520s elapsed cpu time, including 0.431483824s collecting
    2.622673218s elapsed real time, including 0.431739396s collecting
    2013266688 bytes allocated, including 1260431552 bytes reclaimed

ufo5260987423 commented 1 week ago

Thank @gwatt , you're right! @jltaylor-us 's patch works!

time ta6le/bin/ta6le/scheme  --script ../various-program-language-benchmark/src/string/string.scm 

real    0m1.432s
user    0m0.534s
sys 0m0.888s

And my fault, last test I did something wrong. I'll give out a full benchmark performance report here.

And @jltaylor-us , it's your smart work,will you make a pr contributing your patch? My scheme-langserver has been really bothered by string performance recently...

ufo5260987423 commented 1 week ago

NOTE:scheme-local-chezscheme is with the patch,its detail is in scheme-local-chezscheme.txt. Other counterparts please refer various-program-languages-benchmark's output directory

Here's the results:

It seems string patch now it actually the same level with guile;
sumfp-ignore-setuptime and fib report promotion, but I don't know it's coincident or accident, Lol.
other benchmarks doesn't report obvious promotion or reduction.

maoif commented 1 week ago

It's really interesting and useful to see a comparison of different dynamic languages.

maoif commented 1 week ago

@ufo5260987423 It might be better to display the machine information in your benchmark results.

I ran the string benchmark without @jltaylor-us 's patch, using ChezScheme Version 10.0.0:

> (time (my-try 5000000))
(time (my-try 5000000))
    79 collections
    0.792087848s elapsed cpu time, including 0.358229578s collecting
    0.793460921s elapsed real time, including 0.358927788s collecting
    2013317072 bytes allocated, including 1915282208 bytes reclaimed
8388598
> (time (my-try 5000000))
(time (my-try 5000000))
    79 collections
    0.738504446s elapsed cpu time, including 0.314506259s collecting
    0.739721741s elapsed real time, including 0.315044692s collecting
    2013316944 bytes allocated, including 2013463184 bytes reclaimed
8388598
> (time (my-try 5000000))
(time (my-try 5000000))
    79 collections
    0.768487715s elapsed cpu time, including 0.339881611s collecting
    0.769826016s elapsed real time, including 0.340541420s collecting
    2013316944 bytes allocated, including 2013463184 bytes reclaimed
8388598

Using time in my fish shell:

> time scheme --script string.scm

________________________________________________________
Executed in  848.69 millis    fish           external
   usr time  352.20 millis  262.00 micros  351.94 millis
   sys time  494.87 millis   34.00 micros  494.84 millis

My CPU is i9-12900HK, perhaps this contributes to the small numbers?

ufo5260987423 commented 1 week ago

@maoif My machine is a thinkpad t480s. You may find references in detail reports.

Linux ufo-t480s 6.6.44 #1-NixOS SMP PREEMPT_DYNAMIC Sat Aug 3 06:54:42 UTC 2024 x86_64 GNU/Linux

jltaylor-us commented 1 week ago

And @jltaylor-us , it's your smart work,will you make a pr contributing your patch? My scheme-langserver has been really bothered by string performance recently...

Unfortunately, it actually makes things worse on small strings. (On my machine, anyway)

There is likely some small string size threshold for switching between the two implementations, but I think we need more data (preferably across a variety of platforms) to make sure we're not optimizing a synthetic benchmark workload at the expense of more common real-world performance.

gwatt commented 1 week ago

I'd be curious to know why the default substring implementation causes 79 collection while the string-copy! based implementation causes 2 collections. Subtracting out the time spent in collections brings their runtime approximately the same.

ufo5260987423 commented 1 week ago

Unfortunately, it actually makes things worse on small strings. (On my machine, anyway)

How small? Maybe I can add a new benchmark.

jltaylor-us commented 1 week ago

I'd be curious to know why the default substring implementation causes 79 collection while the string-copy! based implementation causes 2 collections. Subtracting out the time spent in collections brings their runtime approximately the same.

I'm pretty curious about that, myself. My current theory is that the implementation with the larger number of scheme "instructions" is causing the collect request handler to fire more often (approximately when it should based on collect-trip-bytes), whereas the version where iterating through the length of the (sub)string is "hidden" behind a single primitive accumulates quite a bit more than collect-trip-bytes before the GC is actually triggered. This seems plausible, but I haven't managed to write examples that I think illustrate this with 100% certainty, and I'm rapidly losing interest for the evening.

LiberalArtist commented 1 week ago

Chez Scheme strings are stored as full 32-bit unicode code points. Much of what you are seeing is the difference needed to allocate and GC four times more memory than an implementation that uses UTF-8 encoded strings. The tradeoff is that Chez Scheme offers constant-time access by character offset.

Swift switched to UTF-8 strings while preserving amortized-constant-time access to characters by storing "breadcrumbs" to convert offsets. There is a blog post and a forum post with even-more-internal details. Here's the most relevant part of the blog post:

As we’ve seen, transcoding a string’s entire contents from UTF-16 to UTF-8 or vice-versa can be a costly operation. But, converting a UTF-16 offset to a UTF-8 offset is a very fast linear scan, essentially summing the high-bits on all the bytes. The very first time an API assuming O(1) access to UTF-16 is used on a large string, it performs this scan and leave breadcrumbs at fixed strides so that it can answer subsequent calls in amortized O(1) time.

The breadcrumbs store an Array of string indices and the length of the string in UTF-16 code units. The ith breadcrumb corresponds to the i * stride UTF-16 offset. Mapping a UTF-16 offset to a UTF-8 offset to access our contents starts at breadcrumbs[offset / stride] and scans forwards from there. Mapping from a UTF-8 offset to a UTF-16 offset (less common) starts with a reasonable estimate and does a binary search from there to find an upper bound and lower bound for the subsequent scan.

Breadcrumb granularity gives us a way to balance between speed and size. Calculating breadcrumbs, their granularity, and even their representation is behind a resilient function call, so all of this can be tweaked and adjusted in the future.

Currently, strings use a very fine granularity, tilting strongly towards speed out of a desire to not introduce unanticipated regressions in any realistic situation. Strings comprised of latter-BMP scalars which have these APIs called on them have a very low memory footprint on the system overall, so memory pressure is not a common concern. As the performance of this scan is improved, granularity can be increased without harming speed.

ASCII is an encoding subset of UTF-16, which means UTF-8 offsets are the same as UTF-16 offsets if the string is entirely in ASCII. All-ASCII strings skip breadcrumbing and just return the answer.

cisco / ChezScheme

Promoting performance on string operations #877