cisco / ChezScheme

Chez Scheme
Apache License 2.0
6.97k stars 986 forks source link

ChezScheme startup time #263

Open xfbs opened 6 years ago

xfbs commented 6 years ago

Chez scheme is a real pleasure to use, I'm especially fond of the expression editor on the command line for some interactive coding. But I have one question — I understand that chez is known for being quite efficient, why does it take so long to start up?

$ echo "(exit)" | time chez -q
chez -q  0.17s user 0.04s system 54% cpu 0.383 total
$ time ruby -e 'exit'
ruby -e 'exit'  0.08s user 0.02s system 93% cpu 0.108 total
$ time python3 -c 'exit()'
python3 -c 'exit()'  0.05s user 0.01s system 76% cpu 0.083 total

Is this possibly a bug?

atticus0 commented 6 years ago

You are comparing the python and ruby interpreter with the chez scheme compiler. Interpreters have generally a faster startup time. Try it with the chez scheme interpreter : time petite -q.

akeep commented 6 years ago

It is worth noting that the scheme and petite are actually the same binary, and both do (roughly) the same thing at startup.

When you start scheme (or petite), after looking through the command line arguments and determining where the boot file(s) are, it loads the boot file(s), then runs a maximum generation collection into the static generation, since the core library is not touched during normal garbage collection.

Loading the boot files consists of uncompressing, reading the fasl-format binary, and executing the expressions in the boot file, which set-up the top-level environment that Chez Scheme programs (including the REPL) execute from.

In the case of scheme this loads both the petite.boot which contains the core library and interpreter and scheme.boot which contains the compiler. In the case of petite it only loads petite.boot, and this does load faster, since it both executes less code and requires less time to collect, since there is less in the Scheme heap when it is called.

So, that is what it is doing during that first ~0.10s to ~0.20s. Could it be faster? Possibly. We could break the boot file into smaller pieces. For instance, the petite.boot file contains things like the expression editor code and the interpreter, which you do not need if you are just executing a pre-compiled Scheme program, and not using the REPL or eval. We have talked, in the past, about breaking this down into smaller piece so that, for instance, if you are running a small script it could be run against a minuscule Scheme session that doesn't bother loading the expression editor or maybe even the interpreter.

All of that said, for me petite loads in 0.10s on my Mac (compared with 0.17s for full Chez Scheme, so I'd probably start by having any script I wanted to run fast pre-compiled and use petite as the session entry point.

gwatt commented 6 years ago

I was playing around with this, and it seems like the majority of the startup time is spent inflating the boot files. I added some logging around parts of the load process, and using already inflated boot files I reduced the load time a fair bit:

gmw@ash: /home/gmw/software/ChezScheme
06:51:32 $ ./ta6le/bin/scheme -q -b ./ta6le/boot/ta6le/petite.boot -b ./ta6le/boot/ta6le/scheme.boot <<< ''
main_init(); took 0.001036 seconds
load(tc, i++, 1); S_boot_time = 0; while (i < boot_count) load(tc, i++, 0); took 0.125844 seconds
if (boot_count != 0) Scompact_heap(); took 0.020594 seconds
gmw@ash: /home/gmw/software/ChezScheme
06:51:40 $ zcat ./ta6le/boot/ta6le/petite.boot ./ta6le/boot/ta6le/scheme.boot > chez.boot                  
gmw@ash: /home/gmw/software/ChezScheme
06:52:42 $ ./ta6le/bin/scheme -q -b ./chez.boot <<< ''                                                     
main_init(); took 0.001707 seconds
load(tc, i++, 1); S_boot_time = 0; while (i < boot_count) load(tc, i++, 0); took 0.040372 seconds
if (boot_count != 0) Scompact_heap(); took 0.019447 seconds

I wonder if more efficient use of zlib is possible?

jltaylor-us commented 6 years ago

Very informative! Seems like the decision about whether to optimize for space or startup speed should be left up to the installation, though. The effect of the file size on startup speed is going to vary widely depending on whether the files are being read from SSD, spinning disk, slow network, etc.

Maybe there should be an option to configure to not zip the boot files? Or maybe if someone makes an installer then it should be an install-time option.

cipriancraciun commented 3 years ago

@gwatt Did you manage to publish somewhere the non-compressed boot file handling?


@akeep Otherwise, for me too startup time is very important, because my intended use-case for Scheme (and thus Chez) is to replace small bash scripts (or similar) that are called many times by other scripts / tools. (For example as part of a build system, or as part of an Ansible-like alternative.) In fact what got me interested into Chez is the small deployment footprint, i.e. the only executable, and the fact that one can have compiled boot files.

During my own benchmarks, just loading petite.boot (without doing anything else) takes 4x than Python2, almost 2x than Python3, and I can't even compare it with bash...

Granted once you actually do stuff, then Python (and other interpreted languages) might take some extra time to parse and compile the code, thus perhaps they'll get even in terms of startup latency.

But still, in case of Chez, there is no parsing and compiling step (when loading a boot image), so it should be blazingly fast.

(For the record, on my 5(?) year old Lenovo T450 it takes around ~50ms which is noticeable. Now imagine a build tool that invokes the same script multiple times in a loop, those 50ms easily become 5 seconds...)


Also granted that perhaps Chez tries to optimize for long-running processes and thus low-latency startup is not a problem. (For example Java, and I assume most JVM-based languages, have very high startup latency. On the other hand I've seen that even Go doesn't optimize for startup latency, and once one has enough dependencies that run initialization code, it can become quite sluggish...)

amirouche commented 3 years ago

When compiled, startup time becomes better, here is with full chez build timings:

sudoscheme % cat exit.scm
(import (chezscheme))

(exit)
% ruse-exe --optimize-level=3 exit.scm
% time ./a.out

real    0m0,048s
user    0m0,020s
sys 0m0,028s

Maybe with a petite scheme binary, it is even faster, but that is not supported by ruse-exe as of yet.

ruse-exe is available at https://github.com/scheme-ruse/ruse-exe/

I guess you know, but if startup time is a problem, you can demonize the program and call the daemon.

Hope this helps.

amirouche commented 3 years ago

Disregard my previous message. If there was a startup time that was big-ish, there is no such thing at this time:

% echo '(exit)' | time scheme -q
0.03user 0.01system 0:00.04elapsed 100%CPU (0avgtext+0avgdata 49292maxresident)k
0inputs+0outputs (0major+11881minor)pagefaults 0swaps
% echo '(exit)' | time petite -q
0.02user 0.00system 0:00.02elapsed 100%CPU (0avgtext+0avgdata 29536maxresident)k
0inputs+0outputs (0major+6867minor)pagefaults 0swaps

That is done with the latest stable release:

scheme --version
9.5.4
cipriancraciun commented 3 years ago

@amirouche Disregard my previous message. If there was a startup time that was big-ish, there is no such thing at this time:

Well at least on my side things are different:

scheme --version
9.5.4

lscpu
...
Model name:                      Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz
...
CPU max MHz:                     2900.0000 
BogoMIPS:                        4589.37

uname -a
Linux localhost 5.13.4-1-default #1 SMP Thu Jul 22 15:55:06 UTC 2021 (91a0cca) x86_64 x86_64 x86_64 GNU/Linux

lsb-release -a
LSB Version:    n/a
Distributor ID: openSUSE
Description:    openSUSE Tumbleweed
Release:        20210827
Codename:       n/a

(I use OpenSUSE Tumbleweed, I've installed Chez Scheme from https://software.opensuse.org/package/chezscheme, the devel:language:misc repository.)

Here are a few tests (granted these are on a "cold" CPU, thus not significant):

touch empty.ss

time scheme --script ./empty.ss
real    0m0.088s
user    0m0.060s
sys     0m0.028s

time scheme -q <<<''
real    0m0.086s
user    0m0.066s
sys     0m0.020s

Here are a few tests by using hyperfine (I use a custom patch that removes the shell invocation, i.e. the --shell '' argument):

hyperfine --shell '' --warmup 100 --runs 100 -- 'scheme --script ./empty.ss'

Benchmark #1: scheme --script ./empty.ss
  Time (mean ± σ):      80.0 ms ±   0.6 ms    [User: 61.0 ms, System: 18.8 ms]
  Range (min … max):    79.0 ms …  81.9 ms    100 runs

I've also tried using --boot petite.boot:

time scheme --boot petite.boot --script ./empty.ss
real    0m0.058s
user    0m0.050s
sys     0m0.008s

time scheme --boot petite.boot -q <<<''
real    0m0.057s
user    0m0.049s
sys     0m0.009s

hyperfine --shell '' --warmup 100 --runs 100 -- 'scheme --boot petite.boot --script ./empty.ss'
Benchmark #1: scheme --boot petite.boot --script ./empty.ss
  Time (mean ± σ):      50.7 ms ±   0.6 ms    [User: 37.8 ms, System: 12.8 ms]
  Range (min … max):    49.8 ms …  53.1 ms    100 runs

I've also tried using the chezscheme-petite package (which I assume that calling petite .. is equivalent to scheme --boot petite.boot ...):

time petite --script ./empty.ss
real    0m0.057s
user    0m0.056s
sys     0m0.001s

time petite -q <<<''
real    0m0.053s
user    0m0.041s
sys     0m0.012s

hyperfine --shell '' --warmup 100 --runs 100 -- 'petite --script ./empty.ss'
Benchmark #1: petite --script ./empty.ss
  Time (mean ± σ):      50.7 ms ±   0.9 ms    [User: 38.6 ms, System: 12.0 ms]
  Range (min … max):    49.5 ms …  56.4 ms    100 runs

For comparison I've tried guile (3.0.7):

hyperfine --shell '' --warmup 100 --runs 100 -- 'guile ./empty.ss'
Benchmark #1: guile ./empty.ss
  Time (mean ± σ):      11.8 ms ±   0.4 ms    [User: 7.8 ms, System: 4.2 ms]
  Range (min … max):    11.2 ms …  13.8 ms    100 runs

(I don't have RPM packages for Gauche and Chibi for OpenSUSE.)

Then Python2 and Python3:

touch ./empty.py

hyperfine --shell '' --warmup 100 --runs 100 -- 'python2 ./empty.py'
Benchmark #1: python2 ./empty.py
  Time (mean ± σ):      10.8 ms ±   1.2 ms    [User: 8.0 ms, System: 2.7 ms]
  Range (min … max):     9.6 ms …  13.1 ms    100 runs

hyperfine --shell '' --warmup 100 --runs 100 -- 'python3 ./empty.py'
Benchmark #1: python3 ./empty.py
  Time (mean ± σ):      25.0 ms ±   0.6 ms    [User: 20.8 ms, System: 4.1 ms]
  Range (min … max):    24.1 ms …  27.1 ms    100 runs

Then Ruby2.7 and Ruby3.0:

touch ./empy.rb

hyperfine --shell '' --warmup 100 --runs 100 -- 'ruby.ruby2.7 ./empty.rb'
Benchmark #1: ruby.ruby2.7 ./empty.rb
  Time (mean ± σ):      59.4 ms ±   0.8 ms    [User: 49.2 ms, System: 10.1 ms]
  Range (min … max):    58.2 ms …  63.2 ms    100 runs

hyperfine --shell '' --warmup 100 --runs 100 -- 'ruby.ruby3.0 ./empty.rb'
Benchmark #1: ruby.ruby3.0 ./empty.rb
  Time (mean ± σ):      70.9 ms ±   3.6 ms    [User: 60.8 ms, System: 9.9 ms]
  Range (min … max):    69.2 ms … 105.6 ms    100 runs

As said I don't believe that the "empty startup latency" is a proper measure because it doesn't actually do anything. For example in Python once you start loading some libraries (including those part of the standard distribution) things get significantly slower.

However the "empty startup latency" is the minimum latency you'll get, and no kind of optimization can make it faster than that... (For example in Python, with some meta-programming tricks, I've managed to lazy-load libraries thus reducing startup time for short scripts.)


I guess you know, but if startup time is a problem, you can demonize the program and call the daemon.

(Here I'm talking about my example use-case of Scheme scripts being called from within a build system that calls many of these small scripts multiple times, each time with slightly different arguments.)

Yes, that would be an option (of using a daemon), but this would defeat the purpose of using a build system like make or ninja, and all of a suddent one would have to manage the startup of the daemon before the build and stop it afterwards.

My hope would have been that once one creates a my-tools.boot containing everything (including the petite.boot), then the startup would be nearly instantaneous.

amirouche commented 3 years ago

I guess you know, but if startup time is a problem, you can demonize the program and call the daemon.

(Here I'm talking about my example use-case of Scheme scripts being called from within a build system that calls many of these small scripts multiple times, each time with slightly different arguments.)

(maybe a script calling another script all-the-way-down is the problem)

nikhilm commented 1 year ago

I am not very familiar with Chez Scheme direct, and dug into this a bit in the context of Racket (v8.6) on Linux. It seems like the fasl loading results in direct read() system calls (evident when running under strace), often reading single bytes at a time. Would there be any improvement by using buffered I/O (i.e. using glibc file operations) and is that even feasible?

For example, running this, where simple.rkt is just a file with #lang racket/base and nothing else:

strace -c racket simple.rkt
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 81.41    0.020731          10      1890           read
 14.51    0.003696           1      3367           lseek
  1.92    0.000488           1       339        75 newfstatat
  0.73    0.000186           1        96           openat
  0.64    0.000164           2        78           mmap
  0.37    0.000094           0        96           close
  0.20    0.000051           5         9           munmap
  0.07    0.000019           0        50           clock_gettime
  0.07    0.000018           2         9         8 readlink
  0.07    0.000017           1        15           epoll_wait
  0.01    0.000002           2         1           futex
  0.00    0.000000           0         9           mprotect
  0.00    0.000000           0         3           brk
  0.00    0.000000           0        22           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         4           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           fcntl
  0.00    0.000000           0         1           getcwd
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         1           epoll_create
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           pipe2
  0.00    0.000000           0         1           prlimit64
  0.00    0.000000           0         1           getrandom
  0.00    0.000000           0         1           rseq
------ ----------- ----------- --------- --------- ----------------
100.00    0.025466           4      6004        85 total
> hyperfine -w 100 -r 100 'racket simple.rkt'
Benchmark 1: racket simple.rkt
  Time (mean ± σ):      88.1 ms ±  13.1 ms    [User: 59.2 ms, System: 28.8 ms]
  Range (min … max):    75.2 ms … 111.9 ms    100 runs

So about 20ms out of 88 are spent in >1000 read() calls.

mflatt commented 11 months ago

Some notes on startup times currently, starting with x86_64 Ubuntu 22.04:

$ echo "(exit)" | time scheme -q
scheme -q  0.03s user 0.03s system 99% cpu 0.065 total
$ time ruby -e 'exit'
ruby -e 'exit'  0.06s user 0.00s system 99% cpu 0.066 total
$ time python3 -c 'exit()'
python3 -c 'exit()'  0.01s user 0.00s system 97% cpu 0.017 total

On an M2 MacBook running macOS Ventura:

$ echo "(exit)" | time scheme -q
scheme -q  0.05s user 0.01s system 97% cpu 0.060 total
$ time ruby -e 'exit'
ruby -e 'exit'  0.05s user 0.02s system 90% cpu 0.077 total
laptop$ time python3 -c 'exit()'
python3 -c 'exit()'  0.02s user 0.01s system 87% cpu 0.042 total

The biggest difference here compared to the original post was the switch from zlib to LZ4 as the default compression for bootfiles. LZ4 is much faster to decompress.

When even faster startup is needed, there's now "vfasl" format that can be used for boot files (this is on macOS):

$ scheme
> (vfasl-convert-file "boot/tarm64osx/petite.boot" "/tmp/petite.boot" '())
> (vfasl-convert-file "boot/tarm64osx/scheme.boot" "/tmp/scheme.boot" '("petite"))
$ echo "(exit)" | time scheme -q -b /tmp/petite.boot -b /tmp/scheme.boot 
scheme -q -b /tmp/petite.boot -b /tmp/scheme.boot  0.03s user 0.01s system 95% cpu 0.042 total

If startup time is important enough to use vfasl, probably you want to disable compression, too:

$ scheme                                                                
> (fasl-compressed #f)
> (vfasl-convert-file "boot/tarm64osx/petite.boot" "/tmp/petite.boot" '())
> (vfasl-convert-file "boot/tarm64osx/scheme.boot" "/tmp/scheme.boot" '("petite"))
$ echo "(exit)" | time scheme -q -b /tmp/petite.boot -b /tmp/scheme.boot
scheme -q -b /tmp/petite.boot -b /tmp/scheme.boot  0.02s user 0.01s system 94% cpu 0.034 total

If loading only "petite":

$ echo "(exit)" | time scheme -q -b petite.boot               
scheme -q -b petite.boot  0.04s user 0.01s system 96% cpu 0.048 total

Using just "petite" with uncompressed vfasl files:

$ echo "(exit)" | time scheme -q -b /tmp/petite.boot 
scheme -q -b /tmp/petite.boot  0.02s user 0.01s system 93% cpu 0.028 total

@nikhilm notes that the read system call is used many times for small reads on startup. The numbers reported for racket use vfasl files, but also three boot files instead of two. The number for Chez Scheme's default configuration are even higher (this is on Linux):

$ echo "(exit)" | strace -c scheme -q
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.87    0.008833           1      4626           read
  0.87    0.000078           1        68           mmap
  0.22    0.000020           2         8           rt_sigaction
  0.03    0.000003           0         6           close
  0.00    0.000000           0         5           mprotect
  0.00    0.000000           0        28           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         4           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0        66        54 readlink
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         5           clock_gettime
  0.00    0.000000           0        10         4 openat
  0.00    0.000000           0         8           newfstatat
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           prlimit64
  0.00    0.000000           0         1           getrandom
  0.00    0.000000           0         1           rseq
------ ----------- ----------- --------- --------- ----------------
100.00    0.008934           1      4846        60 total

769 improves buffering to reduce the read counts:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 68.42    0.001443          51        28           munmap
 24.56    0.000518           3       150           read
  5.17    0.000109           1        68           mmap
  0.95    0.000020           2         8           newfstatat
  0.66    0.000014           2         5           clock_gettime
  0.24    0.000005           0         6           close
  0.00    0.000000           0         5           mprotect
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         8           rt_sigaction
  0.00    0.000000           0         4           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0        66        54 readlink
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0        10         4 openat
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           prlimit64
  0.00    0.000000           0         1           getrandom
  0.00    0.000000           0         1           rseq
------ ----------- ----------- --------- --------- ----------------
100.00    0.002109           5       370        60 total

This improvement has very little effect on startup time, but it moves suspicious behavior out of the way for anyone who wants to investigate more.

nikhilm commented 8 months ago

I have a local racket source checkout. How would I built a racketcs binary using the uncompressed fasl files? I am curious to compare on my machine. Thanks!

mflatt commented 8 months ago

@nikhilm On most platforms, uncompressed vfasl is the default mode for Racket boot files. You can configure with --enable-compressboot to get compressed boot files. (The exception is Windows, where boot files are compressed by default.)