Benchmarking shows all the subcomponents taking up non-trivial computation time. This includes Temportal, which 99% of the time is doing an increment followed by a compare. My hypothesis is that because each subcomponent has its own timer, it's creating unnecessarily duplicated work. This probably isn't much, but maybe we can eek 5-10% out.
Current architecture:
CPU: tick -> cmp_int -> [maybe fire]
Video: tick -> cmp -> dispatch work
Timer: tick -> cmp -> [slow tick]
Temportal: tick -> cmp -> [save state]
Proposed architecture:
Timer: tick -> cmp -> [slow tick]
CPU: tick -> cmp -> dispatch work (cpu inst duration varies so it needs a separate timer)
Video: offset_cmp -> dispatch work (not sure if applying an offset to the video is better than a separate timer)
The ticks show up when profiling individual pieces, but it seems like outside of profiling, the compiler somehow optimizes this away. Even after manually unrolling loops, I did not get any performance gains.
Benchmarking shows all the subcomponents taking up non-trivial computation time. This includes Temportal, which 99% of the time is doing an increment followed by a compare. My hypothesis is that because each subcomponent has its own timer, it's creating unnecessarily duplicated work. This probably isn't much, but maybe we can eek 5-10% out.
Current architecture:
Proposed architecture: