TASEmulators / BizHawk

BizHawk is a multi-system emulator written in C#. BizHawk provides nice features for casual gamers such as full screen, and joypad support in addition to full rerecording and debugging tools for all system cores.
http://tasvideos.org/BizHawk.html
Other
2.13k stars 380 forks source link

Why in the world are you using IPC in the SNES core? #690

Closed meepen closed 8 years ago

meepen commented 8 years ago

Why are you even using two threads? You are literally making the threads do nothing but idle the times where the other thread is doing stuff. You don't need two threads.

SwadicalRag commented 8 years ago

:+1:

Velkon commented 8 years ago

yesplz

oplexz commented 8 years ago

👍

Techbot121 commented 8 years ago

bigthumb

vadosnaprimer commented 8 years ago

Sorry Mario, Byuu is in the other castle.

zeromus commented 8 years ago

libco and .net do not play well together. They're fundamentally incompatible, because .net depends on a proper win32 callstack to fire GC any time C++ calls back into .net, and libco tears up the stack. We tried fixing libco to reconstruct the stack well enough for .net, but failed.

Our bsnes fork calls into .net from c++ for two main reasons: 1) for callbacks such as acquiring input and delivering audio samples (could be worked around by re-engineering insides to buffer whatever's needed); and 2) for lua and trace events (can't be worked around, in principle). Since reason (2) can't be worked around, we didn't bother with reason (1).

We originally solved this whole problem by putting bsnes in another process entirely so that .net and c++ never existed in the same stack. Thus the IPC mechanism was needed.

Later we realized we could move bsnes back into the main bizhawk process if we were very careful to make sure that .net and c++ code never touched the same stack. This entailed using the bulky IPC framework we already had to marshal the calls between c# and c++ without actually directly calling on a foreign stack.

The performance benefits of moving bsnes back into emuhawk.exe are essentially only a reduction of the bulky IPC overhead. This would not really be greatly appreciated except in abstruse scenarios. I don't know how much improvement it made for casual gaming, but it certainly did not make anything worse.

It's possible that the current mechanisms for running the IPC (there are two; if you've used bizhawk more than 3 minutes you may have found the ring buffer option) are not as efficient as could be given the new reality of it running as a thread in the bizhawk process. We haven't re-assessed that since the process->thread move.

But that finally gets me to the answer to your question. The threads idle because it increased the performance of the communication between the processes, which was running with such direly bad performance that no cost (a wasted CPU core) was too high for any tiny speedup through reduced synchronization overhead.

We're going to have the same problem for MAME (and other) cores in libretro. Maybe we'll come up with a better solution, but don't count on it. bigturtle

meepen commented 8 years ago

I kind of already removed pipes from my local version and installed luajit to increase the performance. I don't use any pipes or anything and run it in the same core with the main form updating routines - I believe I have even gotten a 2 ms improvement out of it (judging by fps in-game without display on)

SwadicalRag commented 8 years ago

does this even support IPC?

zeromus commented 8 years ago

@meepdarknessmeep Cool, I look forward to the patches. But if you did both at once, only you can guess which of the two things you did made the performance impact, since only you know how complex your scripts are.

meepen commented 8 years ago

I didn't do both at once - I used luajit to improve the massive slowdown of big lua scripts and included ffi in some of my memory intensive scripts. I later then removed the ipc in the snes c core and c# api and it works flawlessly (as far as I can tell) - but I probably would need some help testing some of the debugging features as I am not sure about all the debugging tools on this program.

meepen commented 8 years ago

Also with LuaJit it would potentially cause incompatibilities with some of the other core's api where it would pause coroutine execution from within lua 'cfunctions' as you seem to have included something to allow that. I had to manually remove some lua_yield calls and add it to my lua scripts after calls to stuff like emu.advanceFrame so it would work properly. We will have to have a discussion about how - and if - you want to handle that in a breaking change (or hacky fix).

vadosnaprimer commented 8 years ago

Bizhawk has a couple of complex scripts to test some things on, but they are for Genesis (would that change anything?). They give huge slowdown due to really frequent calls into the c-side of lua. Could this be helped by your method?

zeromus commented 8 years ago

Well, if you removed the serialization of the commands and events into a pipe, then don't bother with debugging or a patch, it's positively wrong.

I contemplated luajit just now insofar as I checked the diffs and didn't see anything weird in lua precisely (the changes were mostly in luainterface) and thought it seemed safe, but you're clearly more up to date on the subject than I am, but it's possible weird changes predated our committing it. You are right, there is a lot of weirdness going on. Maybe I'm not so sure about this anymore.

meepen commented 8 years ago

@vadosnaprimer I would have to manually change the genesis core to remove ipc altogether

@zeromus Removing pipes (ipc) from this served it well.

The reason pipes aren't needed in this core is because it's not doing anything but making one thread wait on another (which is entirely against the concept of threading!)

Here's some example code from what I have done:


        [DllImport("libsneshawk-32-performance.dll", CallingConvention = CallingConvention.Cdecl)]
        static extern void SetSnesPathRequest(snesPathRequest_t f);

        string snesPathRequest(int slot, string hint)
        {
            string ret = hint;
            if (pathRequest != null)
                hint = pathRequest(slot, hint);
            return hint;
        }

        void ShareSigFunctions()
        {
            SetSnesPathRequest(Keep<snesPathRequest_t>(snesPathRequest));
        }

typedef const char *(__cdecl *snesPathRequest_t)(int slot, const char *hint);
snesPathRequest_t snesPathRequestManaged;

SNES_EXPORT void SetSnesPathRequest(snesPathRequest_t f)
{
    snesPathRequestManaged = f;
}

char SNES_PATH_REQUEST_TEMP[2048];
const char* snes_path_request(int slot, const char* hint)
{
    strncpy(SNES_PATH_REQUEST_TEMP, snesPathRequestManaged(slot, hint), sizeof(SNES_PATH_REQUEST_TEMP));

    return SNES_PATH_REQUEST_TEMP;
}

Also another note - all the libco use for this core is only used for jumping between code - it doesn't stop in the middle of random spots waiting for something else to be done. It's redundant.

zeromus commented 8 years ago

I can see you didn't read my long-winded explanation. One thread has to wait on another because one thread must complete before the other can resume; and there must be two threads to solve the GC problems.

Wait a minute, that example you gave is a prime example of one that can't be done that way. It's a callback from c++ to .net. When that happens, .net has a chance to run the GC, and bsnes may have a cothread switched underneath it. Maybe this exact one happens to work without a cothread switched underneath it, but others definitely have a cothread switched.

I know you won't believe me until you've seen the GC crash. Hopefully you won't ever see that.

meepen commented 8 years ago

That's the complete opposite point of threading. Threads should be independent (mostly) of each other when performing their tasks. What you just described (one thread has to wait on another) can generally be done with one thread.

In this implementation I made a single thread that can update everything at once (which can also be changed to have one thread for the main ui and one thread for the emulator if you need it to)

I also don't seem to see any gc problems with this new implementation - maybe you could show me once I put it on a pull request?

I also understand that in the past with two processes all this was needed, but now it seems that there is no need.

Also you said "it can't be done ... for some other functions", can you explain which ones those are so I can see what you are talking about?

meepen commented 8 years ago

@zeromus ah I see what you mean now. This can be easily prevented by just allowing C# to copy it's string into a pointer passed to the function. Like:


        void snesPathRequest(int slot, IntPtr hint, int length)
        {
            if (pathRequest != null)
            {
                string ret = pathRequest(slot, Marshal.PtrToStringAuto(hint));
                byte[] outdata = new byte[ret.Length];
                for (int i = 0; i < ret.Length; i++)
                    outdata[i] = (byte)ret[i];

                Marshal.Copy(outdata, 0, hint, min(ret.Length, length)); 
            }
        }

and of course adjust all the others accordingly

zeromus commented 8 years ago

(I don't see how your last fix solves anything. The basic problem is supposed to be calling into .net from c++)

You're right, this is not the usual point to threading. It's an extremely unusual point. The usual point for threading is to parallelize. The point for threading here is to separate the c# and .net callstacks. I think it is not hard to come up with other points for threads, too. Did you know? There were threads before there were systems with 2 cpus.

snes_input_poll, snes_video_refresh, etc. Would be more vulnerable to this problem as they're certainly meant to be called from the midst of emulation.

But I need to think about this more. There's a possibility this problem with two callstacks got punctured and deflated at some point. Check snes_video_refresh for example. See the SETCONTROL call? Just above it, we have what is effectively EVEN MORE parameter marshaling, and then SETCONTROL switches back to another cothread. That's an adequate amount of marshaling and stack management to solve the problem--if it's applied carefully.

If that thread used by SETCONTROL was a bsnes-created cothread, then this would positively be subject to the GC bug (which your users will show you in the form of a mysterious crash). However, if it's the main thread (and it seems to be in this case--see co_control = co_active()) then there's no GC bug.

Since the adequate marshaling and stack management is being done by the callbacks in bsnes, the API from c# to bsnes can become fully normal (compared to other bizhawk cores).

I have to think about this more.

meepen commented 8 years ago

I believe you may have not seen this section from a post a few back: Also another note - all the libco use for this core is only used for jumping between code - it doesn't stop in the middle of random spots waiting for something else to be done. It's redundant.

example: your snes_video_refresh code could be changed to:

void snes_video_refresh(const uint32_t *data, unsigned width, unsigned height)
{   
    int destOfs = 
        snesVideoRefreshManaged(width, height);
    char* buf = (char*)hMapFilePtr + destOfs;
    int bufsize = 512 * 480 * 4;
    memcpy(buf, data, bufsize);
}

since all it does is jump back to control, check exitType then exitCallbackType and finally pipes it back to the main thread, and jump back to the other coroutine.

meepen commented 8 years ago

we should probably stop editing our posts as it is definitely causing confusion amoungst ourselves us :-)

zeromus commented 8 years ago

OK, snes_video_refresh was not a good example. As I said early on, that's just stuff that could be buffered (your code snippet response was to buffer it). A better example would be for instance debug_op_read(). You can see that's also got SETCONTROL also.

You're wrong about your quote though. Every time SETCONTROL runs, control switches away from the emulation thread and to the control thread until something switches back to the emulation thread (SETEMU). In the case of debug_op_read, a "BRK" type command is executed, which fires the event into the pipe, then ends up going to HANDLEMESSAGES which blocks reading a response from the emulator. This is all the very definition of stopping in a random spot and doing something else: The cpu may be in the middle of reading an operand, bounce back to the control thread, fire an event to the frontend which funnels it to lua, and then the user can even hang the process there somehow and wait for the user to press a key. In principle this should be used for interactive debugging, but other problems prevent that.

meepen commented 8 years ago

I don't exactly understand what makes it different compared to snes_video_refresh as you yourself said the multithreading in the code is only for gc purposes - meaning it will all run in the same order even if we remove the piping.

In the case of debug_op_read it would call an Action hook in c# which would then pipe more messages in the same order as just calling it straight from c# would do

meepen commented 8 years ago

Ah I stand corrected. It would not be the same in the case of callbacks. Maybe we could right some async callbacks instead inside that action hook?

meepen commented 8 years ago

It doesn't seem like ReadHook is used anywhere - maybe we could worry about the design we are going to use for that in the future?

meepen commented 8 years ago

Sorry for the spam but I am mistaken - I apparently can't do a good grep + reading anymore :-(

zeromus commented 8 years ago

ReadHook is used by bsnes! That's where the debug_op_read ends up, by a very torturous path. Async is NOT acceptable. That's like saying your debug breakpoints should be asynchronous, and allow code to continue executing in the meantime.

But don't forget... I retract my claims that you must necessarily have GC bugs because as I said, the problem may have evaporated due to all the callbacks getting marshaled to the main "control" thread, thus rendering the marshaling/IPC on the c# side redundant truly. I hesitate to say you were right about that, because you were only accidentally right about it, but right you are (probably)

meepen commented 8 years ago

Ah sorry, meant to say sync. (no clue how i got that mixed up while typing)

How and when does debug_op_read get called?

I can't get it to get called from the debugger or from lua.

vadosnaprimer commented 8 years ago

Tracer and Debugger tools I guess. This thing actually decodes the instruction to a mnemonic/operands string by the sound of it.

zeromus commented 8 years ago

it's called when the cpu reads an opcode. Cant remember if its used for the execute breakpoints or not. (Probably). but tracer callback is triggered from the exact same point deep inside the cpu.

meepen commented 8 years ago

Alright since I cannot figure out how to get that to run (the debugger ui confuses me :-( ) I am going to fork this and make a pull request for you guys to maybe test yourself and see if you can find any bugs - (and I will look at how to fix). Is that alright?

zeromus commented 8 years ago

The debugger is junk. Most people use this feature through lua. I look forward to your PR but I may be picky about this. Once I decide this is all safe I may be inclined to purge and rewrite it all myself. But if you did a good job, I guess I wont have to.

Velkon commented 8 years ago

a

vadosnaprimer commented 8 years ago

@meepdarknessmeep Did I not say "Tracer" enough times?

meepen commented 8 years ago

@vadosnaprimer Sorry! I meant the debugger and the Tracer. Neither did anything in my version or the version you guys have on github.

vadosnaprimer commented 8 years ago

How are you checking? If you want a dll breakpoint to actually break the EmuHawk execution, you need to add _asm int 3 to the core code, otherwise EmuHawk will never know when the core enters the tracer callback.

meepen commented 8 years ago

All future conversations should be at #692