Open ISSOtm opened 3 years ago
For reference, gprof
output after running current master rgbasm
on pokecrystal's main.asm
(which produces a 5 MB main.o file): https://pastebin.com/F3n5vfA6
gprof
is probably not a good choice: https://stackoverflow.com/a/1779343
Valgrind could be useful, though not on Windows or Cygwin.
gprof can point to hotspots, however, it has all kinds of problems with the optimizer and static functions. Quite likely the fstk_Init
count is actually a static function that is somewhere after it in the binary.
Disabling the compiler optimizations isn't really an option, as that makes the measurement invalid.
And it looks like gprof is spending a lot of time in it's own accounting function (_mcount_private
), which makes any time based result invalid.
valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes --collect-jumps=yes ./rgbasm -o main.o main.asm
produces accurate-looking results.
Maybe peekInternal
could be optimized for the common peek(0)
and peek(1)
cases, and for when nothing is getting expanded further yet.
I think peekInternal is also reading from disk, so I would read from tmpfs to make sure it's not some kind of disk performance thing you are seeing here.
Also regarding performance, look into whether mmap
is actually an improvement, as noted in #557.
According to perf annotate
on pokecrystal's main.asm
file, 31% of the CPU time is spent copying the yylval
type, which is 264 bytes large (0x108). Moving to variable-size strings (#650) should help reduce this; I think the next larger member of the %union
is the struct Expression
, but that's already much shorter.
The struct Expression
copying could also be improved by arranging its members from largest to smallest for better packing; same for any other widely-copied structs. (Unfortunately this can mean losing some meaningful order to them, but unlike Rust, the compiler won't optimize it for you.)
Bison's C++ parser, using its own variant
instead of a union
and allowing tokens to have nontrivial constructors, significantly slows things down. We might be able to switch back to a C-style one and add manual allocation of nontrivial token values (plus %destructor
s).
Some people have been complaining about RGBDS performance being subpar. With the growing complexity of the codebase, particularly the increasing amount of features—especially for RGBASM—, I am worried about how taxing a given change may be on the overall performance.
This is split in two sub-problems:
Profiling
To know how slow the programs currently are, we should identify their processing bottlenecks. I did that with
perf
once, but the data was fairly lackluster beyond "60% of your time is spent insideyyparse
". Maybegprof
would be better, or something else?Measuring the performance impact of changes
mmap
is actually worth it, in several scenarios (macro-heavy vs. largely linear, for ex.)