Some optimizations - Githubissues

jerch commented 9 years ago

Here are some quick optimizations to speedup the code in CPython. Most bottlenecks are cascades of function calls, esp. the setter/getter of the register objects (commit 9b73477da6612985f0713a4617b53ee47ee1bc00, fac8dbca90a5d5e716be49f5ee60c509620703fd and partly 038475b6b377f0feb67461f5b436ef704ef15255). Also direct memory access in the cpu code shows some benefit (234d57147314db2c49a75eb4ea645e397d530175).

Benchmark results: ~2 Mio. cycles/s in CPython 2.7 and ~17 Mio. cycles/s in PyPy.

jerch commented 9 years ago

Some more speed tests:

Using ctypes c_ubyte and c_ushort types for registers to avoid the fixed width conversions is slightly worse than pure python, I guess the ctype conversion layer is heavier than the handcrafted version.
Dotname reduction: Only tried with the CC-register by moving all its logic into the cpu class. Around 15% speedup (~2.3 Mio cycles/s). Main difference in code is self.cc.Z vs. self.Z. The speed gain is somewhat impressive regarding the fact that it is only one lookup shorter. This might be a promising refactoring ground, not sure how mixins with the MRO will do here.
Implementing standard registers with fast builtin types: Replacing the register objects with a list like [value, name, width] shows around 20% speedup (~2.4 Mio. cycles/s). Downside is the ugly code with all those magic index numbers and more instructions in the cpu code for the .set, .increment and .decrement replacements.

In summary the biggest problems for your emulator in CPython are function calls followed by dotname lookups. Speedwise the best would be one big cpu loop with all the states in the local namespace. Since Python has no low level jumps (goto, switch) this is hardly doable and any attempt will only lead to really ugly code. The best we got for code jumps are function mappings with the cost of all the function contexts around. Btw your elif cascade in .get_ea_indexed is O(n). With a mapping it is O(1), but you would need a function as jump target which inserts more constant overhead than you can gain. As a workaround you could stick with the if cascade and alter it to a binary search.

jedie commented 9 years ago

Thanks for you contribution here!

Please add you to AUTHORS ;)

Dotname reduction: Only tried with the CC-register by moving all its logic into the cpu class. Around 15% speedup (~2.3 Mio cycles/s). Main difference in code is self.cc.Z vs. self.Z. The speed gain is somewhat impressive regarding the fact that it is only one lookup shorter. This might be a promising refactoring ground, not sure how mixins with the MRO will do here.

That sound a nice idea!

Implementing standard registers with fast builtin types: Replacing the register objects with a list like [value, name, width] shows around 20% speedup (~2.4 Mio. cycles/s). Downside is the ugly code with all those magic index numbers and more instructions in the cpu code for the .set, .increment and .decrement replacements.

Yes, ugly code for more speed is possible, but not my destination. If i want speed, than i would not use Python :P

jerch commented 9 years ago

Haha yeah Python should not be the first choice for number crunching. Nevertheless I started a new branch as playground just to see how far CPython can be pushed :D Benchmark is at 2.8 Mio cycles/s (23 for PyPy) and the code already quite unpythonic and degraded. Welcome to the big monster loop ;)

jedie commented 9 years ago

Welcome to the big monster loop ;)

Yes, that will be the fasted way. ...and all local variables ;) ... Maybe a better idea is to generate the code. Look at: https://github.com/6809/MC6809/blob/master/MC6809/components/MC6809data/MC6809_op_data.py

I used this data to generate the CPU skeleton. And i use it to generate: https://github.com/6809/MC6809/blob/master/MC6809/components/cpu_utils/instruction_call.py

The Instruction_generator.py is here: https://github.com/6809/MC6809/blob/master/MC6809/components/cpu_utils/Instruction_generator.py

Maybe it's possible to copy the real OP-code from https://github.com/6809/MC6809/blob/master/MC6809/components/cpu6809.py into the MC6809_op_data.py Then there is every information there to generate a CPU class.

But this is not my intention ;)

Btw. around 880.000 CPU cycles/s is real-time. See: https://github.com/jedie/DragonPy/blob/master/dragonpy/Dragon32/gui_config.py#L77-L84

btw. Please add you to AUTHOR, so i can merge.

6809 / MC6809

Some optimizations #1