Z80A Optimizations - Githubissues

Asnivor commented 5 years ago

@Scepheo @alyosha-tas @vadosnaprimer

So earlier in the year @Scepheo created a generic interface type on MOS 6052X for talking to emulator cores, allowing the JITer to generate much faster code than using the traditional Func<> or Action<> delegate pointers.

Commit is here: https://github.com/TASVideos/BizHawk/commit/f3ea6fe0254302cf9312d2eefe11eb822f6f8bc2

So I have 2 questions really.

Did this provide a noticeable performance benefit during testing for the cores that use it?
And if so, would the Z80A cpu core benefit from having a similar implementation?

@alyosha-tas to my untrained eye it appears as if there is enough delegate use in the core that this might be something we should consider, but looking for other people's opinions on this.

vadosnaprimer commented 5 years ago

I haven't tested, someone should just compare.

alyosha-tas commented 5 years ago

Beats me, I don't really understand what it does or why it's beneficial. Over my head I'm afraid, but do what you want I guess.

Asnivor commented 5 years ago

Ok, so I converted Z80A (in a local branch) to use a generic interface rather than delgates (and setup all the cores to call the generic struct rather than the CPU directly).

Initial (non-scientific) testing is showing an approximate average 1-3 FPS increase on the ZXHawk core. The SMS core looks to actually be slower because of it, but that may be just down to my implementation. There are a couple of other things I will try but ultimately, it feels like that for the amount of 'mess' this generates in the codebase, it's probably not worse the hassle.

However, I have identified the apparent Z80 slowdown of recent times, and it is a big performance hit.

@alyosha-tas It looks like the interrupt handling rewrites is where the money is. If you compare this build (that is before the Z80 commits starting at https://github.com/TASVideos/BizHawk/commit/4cdcb807214dde3cc6de65a2a3612dbbce0e89ec):

https://ci.appveyor.com/project/zeromus/bizhawk-udexo/builds/19055595/artifacts

with this build (that comes soon after the last Z80 commit https://github.com/TASVideos/BizHawk/commit/19a25e55fb8d8b5901cf9e37d8cfd9e95685dfcc):

https://ci.appveyor.com/project/zeromus/bizhawk-udexo/builds/19343904/artifacts

There is a fairly stark contrast...

SMSHawk (SG1000)

ZXHawk (128K +3)

ColecoHawk

Obviously running two bizhawk instances simultaneously is not a great test, but when running the builds independently the FPS disparity is actually a little greater. For completeness, I am running on an i5-6600 @ 3.30GHz with 32GB RAM so this slowdown doesn't actually affect me that much (unless I'm debugging in VS). But for folks with lesser specs this could potentially cause problems, especially if TAStudio is thrown into the mix.

So I guess is there anything obvious with the interrupt rewrite causing this that can be easily fixed? Aside from an increase in conditional logic in https://github.com/TASVideos/BizHawk/commit/4cdcb807214dde3cc6de65a2a3612dbbce0e89ec nothing jumps out at me (and I dont really know much about .NET optimization).....

alyosha-tas commented 5 years ago

I honestly don't know why it slowed down that much after the rewrite. I was taking a pretty big shortcut when I first wrote the z80 code (hoping that I wouldn't need the signal that was of course eventually needed) and there is really no getting around the need to check for the instruction end condition that is now at the end of that loop. That's just how the CPU works.

Maybe someone really good at optimization can identify why exactly it slows down, but it's not that complicated as you mentioned, just an extra check condition at the end.

I could take a guess that this:

else if (IRQS[irq_pntr++] == 1)

is really slow and maybe I should refactor it to just use an integer somehow instead of another array.

Asnivor commented 5 years ago

Well, I have VS 2017 Pro now. I should probably learn how to use the profiling tools and see if that sheds any light on things...

Asnivor commented 5 years ago

Ok, so I have no idea about the profiler :/ I guess I'm not ready for that yet.

But, I tried converting IRQS into an integer (as you suggested) and the results appear to show some success (see https://github.com/TASVideos/BizHawk/tree/z80_optimzation). @alyosha-tas this appears to be working without error, but you might want to check that I havent screwed anything up. BTW That branch is just for testing at the moment, is messy and should not be merged.

September Build

Test Branch: https://github.com/TASVideos/BizHawk/commit/6a60657199a3da9ed6df7b4e3b3949910e79fa30

So that's a bit of an improvement as far as I can see, but its still behind what it was before. Perhaps this is unavoidable as more heavy lifting is being done.

But, it is possible we might get further performance micro-improvements by trying the following (maybe):

Currently I am doing IRQS = BUSRQ.Length; although I would expect it to be faster to explicitly supply an integer rather than doing a array.Length lookup every time.
Currently BUSRQ and MEMRQ are being re-instantiated with every instruction. I wonder if the compiler would appreciate assigning an existing ushort[] to both of these every time rather then re-instantiation? I don't really know that much about IL compiler goes about its business, but maybe its worth a try? This will of course require a (non-state) ushort[] object for each conceivable BUSRQ and MEMRQ combination (created at Z80 instantiation). I don't know if this will work, but in my mind it seems possible that passing a reference type might be quicker than re-initing an array?
If the last point shows performance improvement, then perhaps doing the same with cur_instr will yield beneficial results? The will all get fairly 'messy' pretty quickly though, so some thought probably needs to be put into how to set this out properly.

So, for BUSRQ and MEMRQ we could just have a load of possible array instantiated that are named in a descriptive way I guess. Eg:

        private ushort[] BUSRQ_0;
        private ushort[] BUSRQ_0_I_I;
        private ushort[] BUSRQ_0_I_PCh_0_0_PCh_PCh_PCh_PCh_PCh;
        //....

        private void InitArrays()
        {
            BUSRQ_0 = new ushort[] { 0 };
            BUSRQ_0_I_I = new ushort[] { 0, I, I };
            BUSRQ_0_I_PCh_0_0_PCh_PCh_PCh_PCh_PCh = new ushort[] { 0, I, PCh, 0, 0, PCh, PCh, PCh, PCh, PCh };
            //....
        }

Then just call these as neccessary:

BUSRQ = BUSRQ_0_I_I;

It's gonna look pretty hellish (and you might have a cleaner way to go about it), but if it improves performance what have we got to lose? Although thinking about it, a similar naming scheme for cur_instr arrays will be pretty terrible.. Maybe name them based on the enclosing method where possible? I dunno..

Maybe we just create static arrays for everything based on the method name, or even a static struct for each that contains BUSRQ, MEMRQ, IRQS (int) and cur_instr? That would probably look much cleaner and be more manageable.

edit - so cur_instr is created with values passed into the method, so this is not easily doable it seems. edit again - and so are MEMRQ and BUSRQ... guess I need to think about this more..

I'm conscious that I don't want to inadvertently turn the Z80 codebase into some kind of awful mess (more than I have already done), but i'll trying playing around a little more in the branch when I have time to see if these ideas work.

*edits - lots of edits

alyosha-tas commented 5 years ago

Readability is high priority for me. I tried quite hard to structure the code base in a way that was easily understandable and maintainable. Personally I'm willing to sacrifice a bit of performance to keep it that way.

Given the performance boost you got just from changing IRQS into an integer, it seems obviously good to do so. I didn't know ahead of time I would have been able to do that when I was first writing the code, so that's a good optimization.

If everything is good and it still makes it by all the Spectrum tests, then I'd say merge away.

Asnivor commented 5 years ago

If everything is good and it still makes it by all the Spectrum tests, then I'd say merge away.

So i've changed all the .Length lookups to integer assignments in branch here: https://github.com/TASVideos/BizHawk/commit/b4219b8242828ae5371b309b8e17c853bcab3cd5 Performance is good and all speccy/z80 tests are passing.

Readability is high priority for me. I tried quite hard to structure the code base in a way that was easily understandable and maintainable.

I did get to thinking if we could further increase performance without sacrificing structure and readability. I think the answer is 'yes' but you will need the final say on this.

Check the latest commit in the branch: https://github.com/TASVideos/BizHawk/commit/32cce86f51b6361b26a9559ffa81c38df1359ff7

I've basically given the cur_instr, BUSRQ and MEMRQ arrays a fixed size based on their current potential maximum size:

    public int instr_pntr = 0;
    public int bus_pntr = 0;
    public int mem_pntr = 0;
    public int irq_pntr = 0;
    public ushort[] cur_instr = new ushort[38];
    public ushort[] BUSRQ = new ushort[19]; 
    public ushort[] MEMRQ = new ushort[19];
    public int IRQS;

Created new methods (with optional default parameters) that assign values within the fixed arrays:

/// <summary>
/// Optimization method to set BUSRQ
/// </summary>      
private void PopulateBUSRQ(ushort d0 = 0, ushort d1 = 0, ushort d2 = 0, ushort d3 = 0, ushort d4 = 0, ushort d5 = 0, ushort d6 = 0, ushort d7 = 0, ushort d8 = 0,
    ushort d9 = 0, ushort d10 = 0, ushort d11 = 0, ushort d12 = 0, ushort d13 = 0, ushort d14 = 0, ushort d15 = 0, ushort d16 = 0, ushort d17 = 0, ushort d18 = 0)
{
    BUSRQ[0] = d0; BUSRQ[1] = d1; BUSRQ[2] = d2;
    BUSRQ[3] = d3; BUSRQ[4] = d4; BUSRQ[5] = d5;
    BUSRQ[6] = d6; BUSRQ[7] = d7; BUSRQ[8] = d8;
    BUSRQ[9] = d9; BUSRQ[10] = d10; BUSRQ[11] = d11;
    BUSRQ[12] = d12; BUSRQ[13] = d13; BUSRQ[14] = d14;
    BUSRQ[15] = d15; BUSRQ[16] = d16; BUSRQ[17] = d17;
    BUSRQ[18] = d18;
}

/// <summary>
/// Optimization method to set MEMRQ
/// </summary>  
private void PopulateMEMRQ(ushort d0 = 0, ushort d1 = 0, ushort d2 = 0, ushort d3 = 0, ushort d4 = 0, ushort d5 = 0, ushort d6 = 0, ushort d7 = 0, ushort d8 = 0,
    ushort d9 = 0, ushort d10 = 0, ushort d11 = 0, ushort d12 = 0, ushort d13 = 0, ushort d14 = 0, ushort d15 = 0, ushort d16 = 0, ushort d17 = 0, ushort d18 = 0)
{
    MEMRQ[0] = d0; MEMRQ[1] = d1; MEMRQ[2] = d2;
    MEMRQ[3] = d3; MEMRQ[4] = d4; MEMRQ[5] = d5;
    MEMRQ[6] = d6; MEMRQ[7] = d7; MEMRQ[8] = d8;
    MEMRQ[9] = d9; MEMRQ[10] = d10; MEMRQ[11] = d11;
    MEMRQ[12] = d12; MEMRQ[13] = d13; MEMRQ[14] = d14;
    MEMRQ[15] = d15; MEMRQ[16] = d16; MEMRQ[17] = d17;
    MEMRQ[18] = d18;
}

/// <summary>
/// Optimization method to set cur_instr
/// </summary>  
private void PopulateCURINSTR(ushort d0 = 0, ushort d1 = 0, ushort d2 = 0, ushort d3 = 0, ushort d4 = 0, ushort d5 = 0, ushort d6 = 0, ushort d7 = 0, ushort d8 = 0,
    ushort d9 = 0, ushort d10 = 0, ushort d11 = 0, ushort d12 = 0, ushort d13 = 0, ushort d14 = 0, ushort d15 = 0, ushort d16 = 0, ushort d17 = 0, ushort d18 = 0,
    ushort d19 = 0, ushort d20 = 0, ushort d21 = 0, ushort d22 = 0, ushort d23 = 0, ushort d24 = 0, ushort d25 = 0, ushort d26 = 0, ushort d27 = 0, ushort d28 = 0,
    ushort d29 = 0, ushort d30 = 0, ushort d31 = 0, ushort d32 = 0, ushort d33 = 0, ushort d34 = 0, ushort d35 = 0, ushort d36 = 0, ushort d37 = 0)
{
    cur_instr[0] = d0; cur_instr[1] = d1; cur_instr[2] = d2;
    cur_instr[3] = d3; cur_instr[4] = d4; cur_instr[5] = d5;
    cur_instr[6] = d6; cur_instr[7] = d7; cur_instr[8] = d8;
    cur_instr[9] = d9; cur_instr[10] = d10; cur_instr[11] = d11;
    cur_instr[12] = d12; cur_instr[13] = d13; cur_instr[14] = d14;
    cur_instr[15] = d15; cur_instr[16] = d16; cur_instr[17] = d17;
    cur_instr[18] = d18; cur_instr[19] = d19; cur_instr[20] = d20;
    cur_instr[21] = d21; cur_instr[22] = d22; cur_instr[23] = d23;
    cur_instr[24] = d24; cur_instr[25] = d25; cur_instr[26] = d26;
    cur_instr[27] = d27; cur_instr[28] = d28; cur_instr[29] = d29;
    cur_instr[30] = d30; cur_instr[31] = d31; cur_instr[32] = d32;
    cur_instr[33] = d33; cur_instr[34] = d34; cur_instr[35] = d35;
    cur_instr[36] = d36; cur_instr[37] = d37;
}

Then replaced every existing array re-init with calls to the above methods, retaining the same syntax (and hopefully readability) that was there before:

private void IN_()
{
    PopulateCURINSTR
            (IDLE,
                TR, W, A,
                WAIT,
                RD_INC, Z, PCl, PCh,
                IDLE,
                WAIT,
                WAIT,
                IN_A_N_INC, A, Z, W);

    PopulateBUSRQ(0, PCh, 0, 0, WIO1, WIO2, WIO3, WIO4);
    PopulateMEMRQ(0, PCh, 0, 0, WIO1, WIO2, WIO3, WIO4);
    IRQS = 8;
}

private void IN_REG_(ushort dest, ushort src)
{
    PopulateCURINSTR
            (IDLE,
                TR16, Z, W, C, B,
                WAIT,
                WAIT,
                IN_INC, dest, Z, W);

    PopulateBUSRQ(0, BIO1, BIO2, BIO3, BIO4);
    PopulateMEMRQ(0, BIO1, BIO2, BIO3, BIO4);
    IRQS = 5;
}

Although the new methods themselves look smelly, the code should be more performant once compiled, as (if I am reading things correctly), the new values are just being written into the arrays, rather than the array being marked for GC, a new array being allocated and only then being populated.

I was a little concerned at first with having fixed size arrays, but it doesnt look like the code is evaluating array length at any point so everything must be using the various _pntr integers to determine location within each array (which makes sense).

So, the results:

As you can see, in the case of the Spectrum we are seeing similar performance to that of the pre-interrupt work September build (which is good), whereas we are seeing a marked improvement on SG1000 and Coleco (which is even better). I assume the reason for this is that ZXHawk is non-cpu core heavy (compared to other in house cores - a fancy way of saying its slower) so z80 optimizations are not as apparent, whereas the other cores have less non-cpu stuff going on.

Again, all the spectrum tests are passing.

@alyosha-tas hopefully this sticks to the readability 'brief' and is an acceptable method of optimization for you.

Let me know your thoughts/issues/things that i've messed up.

alyosha-tas commented 5 years ago

Clever! Great work!

Yeah definitely merge that.

Asnivor commented 5 years ago

cool, done.

TASEmulators / BizHawk

Z80A Optimizations #1395