ITotalJustice / notorious_beeg

gba emulator written in c++23
https://notorious-beeg.netlify.app/
GNU General Public License v3.0
41 stars 4 forks source link

reduce template bloat of instruction handlers #11

Closed ITotalJustice closed 2 years ago

ITotalJustice commented 2 years ago

from my readme:

reduce template bloat for arm/thumb instructions by creating dedicated functions for every possible version,
ie, data_proc_add_imm_s. this will GREATLY reduce bloat, which is nicer for icache and also compile times.

to achive this, still have normal template function, such as data_proc, so data_proc_add_imm_s will call that template func.
the template<> will have to be toggled by a macro so that debug builds can still be fast (instant) and not templated at all.

my cpu has 256KiB icache, which my final binary far exceeds (1.0 MiB (1,094,928)), this is with full optimisations and lto. without lto, it's much bigger stil...

really there's not much code to the emulator, so i really think i can at the very least get it to ~512KiB, likely a LOT smaller.

without tables generated (-O3 -lto) and built as a single file (all hot functions inlined) the final binary is 123.6 KiB (126,520).


summary: (all -O3 -lto, single file (force inlined r/w funcs)

ITotalJustice commented 2 years ago
[ARM] 2553
data_processing: 895
// multiply: 4
multiply_long: 8
// single_data_swap: 2
// branch_and_exchange: 1
halfword_data_transfer_register_offset: 55
halfword_data_transfer_immediate_offset: 59
single_data_transfer: 1024
// undefined: 768
block_data_transfer: 512
// branch: 512
// software_interrupt: 256

[THUMB] 784
move_shifted_register: 96
add_subtract: 32
move_compare_add_subtract_immediate: 128
alu_operations: 16
hi_register_operations: 16
pc_relative_load: 32
load_store_with_register_offset: 32
load_store_sign_extended_byte_halfword: 32
load_store_with_immediate_offset: 128
load_store_halfword: 64
sp_relative_load_store: 64
load_address: 64
// add_offset_to_stack_pointer: 4
push_pop_registers: 16
multiple_load_store: 64
// conditional_branch: 60
// software_interrupt: 4
// unconditional_branch: 32
// long_branch_with_link: 64
// undefined: 76

TOTAL: 3337

this the total number of functions generated for each function.

the commented out functions are those that are not templated, so they're not counted in the total.

the most generated function by far is single data transfer at 1024. however, only 6bits are needed to decode everything in the instruction. so 6*6=36. 1024 down to just 36 instructions...

ITotalJustice commented 2 years ago

some notes regarding min size without templating reg/imm in data_proc and single_data.

data processing:

// 245.3 KiB (251,168)
// 372.3 KiB (381,192) max
// 312.6 KiB (320,064) new

single data transfer:

// 242.4 KiB (248,256) without
// 269.4 KiB (275,832) 00
// 431.6 KiB (442,000) max
// 317.5 KiB (325,104) new
ITotalJustice commented 2 years ago

with the above two commits, as of commit https://github.com/ITotalJustice/notorious_beeg/commit/ee92bc5202bc4b46b1001a8ee16914142babae40, the final size is: 366.1 KiB (374,864)

that's a reduction of 703.1 KiB (720064)