Open gmounie opened 2 years ago
High I/O performance with lots of very short writes by the user requires user-space buffering, which most other languages do for that reason. In C, that's the difference between the write()
function (a system call into the kernel) and the fwrite()
function (a libc function that performs a copy into a user-space buffer, calling write()
only whenever that buffer gets full, or in case of stdout whenever a line ends).
Julia still doesn't buffer its stdout by default. Each call to println(1)
actually results in a whopping six (6) Linux/x86_64 system calls, each doing a context switch, saving all the registers, flushing TLBs, doing lots of security checks, etc. etc.:
$ strace julia -e 'for i=1:100; println(1); end'
[...]
write(13, "1", 1) = 1
epoll_wait(3, [], 1024, 0) = 0
epoll_wait(3, [], 1024, 0) = 0
write(13, "\n", 1) = 1
epoll_wait(3, [], 1024, 0) = 0
epoll_wait(3, [], 1024, 0) = 0
[...]
Yes, even the newline causes an (always very expensive) kernel context switch!
In comparison, the same operation in C (printf("%d\n", 1);
) causes only one system call when writing to stdout:
$ echo 'main(){for(int i; i<100; i++){printf("%d\n",1);}}' | gcc -x c - && strace ./a.out
[...]
write(1, "1\n", 2) = 2
[...]
And just 0.01 system calls when writing into a file:
$ echo 'main(){for(int i; i<100; i++){printf("%d\n",1);}}' | gcc -x c - && strace ./a.out >/dev/null
[...]
write(1, "1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n"..., 200) = 200
[...]
See also #30044 (if you are not afraid of references to some strong language by Linus T. on the subject of single-byte I/O calls into the kernel; I guess that's how he keeps the kernel so fast ...).
This is obviously not very good. So what is the plan to get buffered I/O by default into Julia?
A workaround is to first print into an IOBuffer
:
$ strace julia -e 'out=IOBuffer(); for i=1:100; println(out, 1); end; write(stdout, take!(out))'
[...]
write(12, "1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n"..., 200) = 200
[...]
Another workaround is to use Base.buffer_writes(stdout)
together with flush(stdout)
, I'm not sure why that function isn't exported.
Btw, if you enable buffer_writes(stdout)
, Julia doesn't flush stdout when it exits. I'm not sure that's intentional.
A workaround is to first print into an IOBuffer:
A better workaround is to use BufferedStreams.jl and write to bio = BufferedOutputStream(stdout)
, since that uses a bounded-size (128KiB) buffer.
In brief: Julia println output speed seems oddly slow, or costly, in an interactive terminal compare to several other languages. The slow behavior is only occurring in an interactive terminal (tried several such as gnome-terminal, xterm, qterminal, sakura, emacs-vterm). No slowness are visible when the standard output is redirected.
1) In details (Linux Debian sid, julia 1.5.3, similar behavior on several x86_64 (Intel, AMD)) I wrote recursive Fibonacci computation with printf inside (Julia code below), in several languages for, e.g., demonstrating the synchronization in a terminal for my OS lecture. Julia computation performance is on par with C, C++, Rust, Fortran, Go, Crystal, without the println, but, with the println, Julia is much much slower (2.5 - 6.5 times slower) than even python3, perl5, ruby or node-js. Only Raku is slower but because the Raku 2021.09 computation are still very slow.
If the output is redirected to /dev/null, Julia is again on par with the fastest languages. The most comparable language I tried, "crystal" (ruby compilation with LLVM), does not have the same behavior.
2) Remarks: According to /usr/bin/time ('time" package in Debian) Julia seems to spend much more time in the kernel: twice the time compare to perl or python.
perf trace julia... records roughly 4M syscalls for fibo(30) compare to roughly 1M in C, perl or python, and 2M for crystal
perf record/report indicates VDSO gettimeofday at the top (3%)
3) My Fibonacci code: