crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.22k stars 1.61k forks source link

Trace GC and Scheduler #14599

Closed ysbaddaden closed 1 month ago

ysbaddaden commented 1 month ago

Introduces a mechanism to trace the GC and scheduler operations as they happen, when they happen and what they do.

This has proven invaluable at understanding multi-threaded synchronization issues, that would have been impossible to reproduce using a debugger (you can't step manually and hope to reach a race condition: you need brute force and then analysis). In addition tracing the GC can lead to interesting stats about how much time is spent on GC (malloc, collect, ...), or find out how many allocations happened and when they happen. With both traces, we can cross analyze the lifetime of a program.

The tracing itself is relatively simple: each trace is a line with its section (gc, sched), the operation (malloc, enqueue, ...) then context metadata (thread, fiber, time + duration) and eventually metadata about the operation. The trace is meant to be easy to parse and read, to be grepped, searched, copy pasted, you name it.

Tracing is disabled by default, and must be enabled manually by compiling with -Dtracing. We then enable one or more sections to trace with the CRYSTAL_TRACE environment variable, which is the list of section names separated by a comma (,). For example CRYSTAL_TRACE=gc,sched.

Examples hello.cr $ CRYSTAL_TRACE=gc bin/crystal run -Dtracing hello.cr gc malloc t=283627 d=6049 thread=0x7fcde1731680 [?] size=16 gc malloc t=322742 d=193 thread=0x7fcde1731680 [?] size=24 gc malloc t=329923 d=7241 thread=0x7fcde1731680 [?] size=4 atomic=1 gc malloc t=344512 d=152 thread=0x7fcde1731680 [?] size=24 gc malloc t=355060 d=144 thread=0x7fcde1731680 [?] size=24 gc malloc t=361840 d=154 thread=0x7fcde1731680 [?] size=24 gc malloc t=369117 d=4589 thread=0x7fcde1731680 [?] size=51 atomic=1 gc malloc t=396543 d=171 thread=0x7fcde1731680 [?] size=4 atomic=1 gc malloc t=418923 d=2335 thread=0x7fcde1731680 [?] size=16 atomic=1 gc malloc t=461257 d=4847 thread=0x7fcde1731680 [?] size=232 gc malloc t=474812 d=1880 thread=0x7fcde1731680 [?] size=56 gc malloc t=483029 d=164 thread=0x7fcde1731680 [?] size=4 atomic=1 gc malloc t=489296 d=145 thread=0x7fcde1731680 [?] size=56 gc malloc t=495286 d=121 thread=0x7fcde1731680 [?] size=4 atomic=1 gc malloc t=501503 d=129 thread=0x7fcde1731680 [?] size=56 gc malloc t=507676 d=199 thread=0x7fcde1731680 [?] size=4 atomic=1 gc malloc t=513894 d=128 thread=0x7fcde1731680 [?] size=56 gc malloc t=519944 d=138 thread=0x7fcde1731680 [?] size=4 atomic=1 gc malloc t=533452 d=159 thread=0x7fcde1731680 [?] size=232 gc malloc t=539849 d=140 thread=0x7fcde1731680 [?] size=56 gc malloc t=545960 d=138 thread=0x7fcde1731680 [?] size=4 atomic=1 gc malloc t=552102 d=127 thread=0x7fcde1731680 [?] size=56 gc malloc t=558081 d=130 thread=0x7fcde1731680 [?] size=4 atomic=1 gc malloc t=564211 d=121 thread=0x7fcde1731680 [?] size=56 gc malloc t=570196 d=196 thread=0x7fcde1731680 [?] size=4 atomic=1 gc malloc t=576388 d=128 thread=0x7fcde1731680 [?] size=56 gc malloc t=582328 d=137 thread=0x7fcde1731680 [?] size=4 atomic=1 gc malloc t=599921 d=4268 thread=0x7fcde1731680 [?] size=104 gc malloc t=661946 d=6059 thread=0x7fcde1731680 [?] size=144 gc malloc t=676737 d=4050 thread=0x7fcde1731680 [?] size=32 gc malloc t=686896 d=325 thread=0x7fcde1731680 [?] size=48 atomic=1 gc malloc t=694224 d=120 thread=0x7fcde1731680 [?] size=32 gc malloc t=700157 d=197 thread=0x7fcde1731680 [?] size=48 atomic=1 gc malloc t=706938 d=3931 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=64 gc malloc t=726741 d=172 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=24 gc malloc t=734605 d=146 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=16 gc malloc t=742094 d=136 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=24 gc malloc t=749643 d=155 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=757559 d=172 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=24 gc malloc t=765037 d=1964 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=8 gc malloc t=774480 d=164 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=144 gc malloc t=790856 d=188 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=32 gc malloc t=800781 d=141 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=144 gc malloc t=816696 d=238 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=824971 d=165 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=24 gc malloc t=832422 d=161 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=840397 d=232 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=850945 d=164 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=96 gc realloc t=860696 d=565 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=48 gc malloc t=868865 d=204 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=876318 d=207 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=883939 d=2535 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=37 atomic=1 gc realloc t=899943 d=238 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=29 gc malloc t=909028 d=211 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=15 atomic=1 gc malloc t=918448 d=138 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=15 atomic=1 gc malloc t=926308 d=208 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=940896 d=147 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=954380 d=189 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=232 gc malloc t=962346 d=216 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=969804 d=136 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=977310 d=208 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=984685 d=209 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=992156 d=426 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=999790 d=133 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=1007173 d=218 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=1014625 d=133 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=1024269 d=147 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=232 gc malloc t=1031810 d=238 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=1039291 d=133 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=1046683 d=261 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=1054095 d=422 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=1061816 d=253 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=1069240 d=130 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=1076645 d=133 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=1084035 d=129 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=1093576 d=134 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=1101230 d=159 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=1108816 d=142 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 gc malloc t=1116270 d=149 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=24 gc malloc t=1123615 d=143 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=4 atomic=1 gc malloc t=1131089 d=131 thread=0x7fcde1731680 [?] fiber=0x7fcde16fef00 [main] size=56 ``` Same, but multi-threaded and tracing the scheduler this time: ```console $ CRYSTAL_TRACE=sched bin/crystal run -Dtracing -Dpreview_mt hello.cr sched resume t=1536163 d=-1 thread=0x7f5c3860c700 [CRYSTAL-MT-1] fiber=0x7f5c3c628dc0 [main] fiber=0x7f5c3c628d20 [Stack pool collector] sched sleep t=1589839 d=-1 thread=0x7f5c3860c700 [CRYSTAL-MT-1] fiber=0x7f5c3c628d20 [Stack pool collector] duration=5000000000 sched resume t=1648860 d=-1 thread=0x7f5c3760b700 [CRYSTAL-MT-2] fiber=0x7f5c3c628c80 [main] fiber=0x7f5c3c628be0 [Stack pool collector] sched resume t=1664881 d=-1 thread=0x7f5c3860c700 [CRYSTAL-MT-1] fiber=0x7f5c3c628d20 [Stack pool collector] fiber=0x7f5c3c628dc0 [main] sched sleep t=1678251 d=-1 thread=0x7f5c3760b700 [CRYSTAL-MT-2] fiber=0x7f5c3c628be0 [Stack pool collector] duration=5000000000 sched mt:sleeping t=1681279 d=-1 thread=0x7f5c3860c700 [CRYSTAL-MT-1] fiber=0x7f5c3c628dc0 [main] sched yield t=1690430 d=-1 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628f00 [main] sched resume t=1739355 d=-1 thread=0x7f5c3760b700 [CRYSTAL-MT-2] fiber=0x7f5c3c628be0 [Stack pool collector] fiber=0x7f5c3c628c80 [main] sched reschedule t=1747366 d=-1 thread=0x7f5c3860c700 [CRYSTAL-MT-1] fiber=0x7f5c3c628dc0 [main] sched resume t=1746231 d=-1 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628f00 [main] fiber=0x7f5c3c628e60 [Worker Loop] sched mt:sleeping t=1756980 d=-1 thread=0x7f5c3760b700 [CRYSTAL-MT-2] fiber=0x7f5c3c628c80 [main] sched resume t=1770997 d=-1 thread=0x7f5c36e0a700 [CRYSTAL-MT-3] fiber=0x7f5c3c628b40 [main] fiber=0x7f5c3c628aa0 [Stack pool collector] sched sleep t=1831512 d=-1 thread=0x7f5c36e0a700 [CRYSTAL-MT-3] fiber=0x7f5c3c628aa0 [Stack pool collector] duration=5000000000 sched reschedule t=1843593 d=-1 thread=0x7f5c3760b700 [CRYSTAL-MT-2] fiber=0x7f5c3c628c80 [main] sched resume t=1850540 d=-1 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628e60 [Worker Loop] fiber=0x7f5c3c628a00 [Stack pool collector] sched sleep t=1873291 d=-1 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628a00 [Stack pool collector] duration=5000000000 sched resume t=1885645 d=-1 thread=0x7f5c36e0a700 [CRYSTAL-MT-3] fiber=0x7f5c3c628aa0 [Stack pool collector] fiber=0x7f5c3c628b40 [main] sched resume t=1893792 d=-1 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628a00 [Stack pool collector] fiber=0x7f5c3c628e60 [Worker Loop] sched mt:sleeping t=1901774 d=-1 thread=0x7f5c36e0a700 [CRYSTAL-MT-3] fiber=0x7f5c3c628b40 [main] sched mt:sleeping t=1906920 d=-1 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628e60 [Worker Loop] sched reschedule t=2213898 d=-1 thread=0x7f5c36e0a700 [CRYSTAL-MT-3] fiber=0x7f5c3c628b40 [main] sched reschedule t=2228347 d=-1 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628e60 [Worker Loop] sched enqueue t=2252625 d=474 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628e60 [Worker Loop] fiber=0x7f5c3c628f00 [main] sched event_loop t=2249414 d=19731 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628e60 [Worker Loop] sched resume t=2280709 d=-1 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628e60 [Worker Loop] fiber=0x7f5c3c628f00 [main] sched spawn t=2311378 d=-1 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628f00 [main] fiber=0x7f5c3c628960 [Signal Loop] sched enqueue t=2325884 d=7211 thread=0x7f5c3c654740 [?] fiber=0x7f5c3c628f00 [main] fiber=0x7f5c3c628960 [Signal Loop] sched enqueue t=2347459 d=986 thread=0x7f5c3860c700 [CRYSTAL-MT-1] fiber=0x7f5c3c628dc0 [main] fiber=0x7f5c3c628dc0 [main] sched event_loop t=1761043 d=613105 thread=0x7f5c3860c700 [CRYSTAL-MT-1] fiber=0x7f5c3c628dc0 [main] sched mt:slept t=1705398 d=695178 thread=0x7f5c3860c700 [CRYSTAL-MT-1] fiber=0x7f5c3c628dc0 [main] sched resume t=2421042 d=-1 thread=0x7f5c3860c700 [CRYSTAL-MT-1] fiber=0x7f5c3c628dc0 [main] fiber=0x7f5c3c628960 [Signal Loop] ```

The crystal tool trace command has been added that will generate stats from a trace file. Using an MT trace of the hello world example above with both the GC and the scheduler enabled:

$ CRYSTAL_TRACE=sched,gc bin/crystal run -Dpreview_mt -Dtracing hello.cr 2> trace.log
$ bin/crystal tool trace trace.log
gc:malloc events=225 durations=[total=1ms min=214ns max=438us mean=4us ±31us] sizes=[total=171KB min=4B max=32KB mean=780B ±4KB]
gc:realloc events=5 durations=[total=2us min=286ns max=777ns mean=534ns ±184ns]
gc:heap_resize events=3
gc:collect:mark events=1 durations=[total=156us min=156us max=156us mean=156us ±0ns]
gc:collect:sweep events=1 durations=[total=22us min=22us max=22us mean=22us ±0ns]
gc:collect events=1 durations=[total=386us min=386us max=386us mean=386us ±0ns]
sched:yield events=1
sched:resume events=14
sched:sleep events=4
sched:mt:sleeping events=5
sched:reschedule events=7
sched:enqueue events=5 durations=[total=14us min=452ns max=6us mean=2us ±2us]
sched:event_loop events=3 durations=[total=778us min=18us max=666us mean=259us ±289us]
sched:spawn events=1
sched:mt:slept events=2 durations=[total=1ms min=914us max=1ms mean=959us ±44us]

Note that the command isn't very performant. It's doing some effort to avoid allocating memory as much as possible (we can likely still optimize), but parsing 60 million events trace will still take dozens of seconds. You can use the --fast switch to skip some details to shave some time.

Technical notes

The trace is buffered (on the stack) then written once to the standard error directly (without going through STDERR) and not written to a file. This is to avoid interleaved/smashed traces as much as possible when tracing from multiple threads (taking advantage).

Compiling with -Dtracing has a limited impact on the executable performance (by default nothing is traced). Enabling some traces then has more impact, but the executable is still somewhat performant (despite measuring & tracing millions of events) that race conditions are still easily reproducible for example.

We also don't go though a channel like logs are, to avoid fiber context switches that would pollute the trace, but also because we can't: tracing will commence before the crystal application has even started.

Evolutions

We could augment the tracing to report more/less data about the program, for example only output GC collection stats (with more before/after details). We could also listen to a signal (and/or a keyboard shortcut) to stop the world and print details about the threads and schedulers.

HertzDevil commented 1 month ago

Would this fit better in crystal-lang/perf-tools, or does it touch so many internals that a shard implementation is impossible?

ysbaddaden commented 1 month ago

@HertzDevil I've been wondering about that. On the general side it's mostly having main call Crystal::Tracing.init, then there is all the little patches to inject the trace calls... nothing impossible, but tedious. The GC collection stats is also very dependent on the underlying implementation, but with callbacks it could be abstracted?

So, it's possible to externalize, and the perf-tools shard demonstrates that.

I still personally lean to have it right into stdlib, ready to be used and extended (e.g. sync primitives: channels, wait groups, mutexes) with just a compilation flag.

ysbaddaden commented 1 month ago

Closed in favor to #14659