TASEmulators / BizHawk

BizHawk is a multi-system emulator written in C#. BizHawk provides nice features for casual gamers such as full screen, and joypad support in addition to full rerecording and debugging tools for all system cores.
http://tasvideos.org/BizHawk.html
Other
2.2k stars 385 forks source link

[Lua] Memory Corruption on 1.13.0 #867

Closed Isotarge closed 7 years ago

Isotarge commented 7 years ago

Just downloaded BizHawk 1.13.0 and tried running ScriptHawk with it. It runs fine for a couple of seconds and then throws this exception in the Lua console:

LuaInterface.LuaScriptException: [string "main"]:1370: attempt to index a nil value

The line referenced is an emu.yield() call inside a while true loop at the bottom of ScriptHawk.lua used to refresh the ScriptHawk OSD when the emulator is paused.

Running the TAS safe version of ScriptHawk completely locks up the emulator and requires a force-quit.

This happens with N64, SMS, and GBC modules. Both versions of ScriptHawk worked fine in 1.12.2.

I've looked over the changelog for 1.13.0 and didn't see any Lua API changes that would break ScriptHawk so I'm fairly sure it's something broken on BizHawk's end. Happy to change something on ScriptHawk's end though if I've done something I shouldn't have.

Cheers, Isotarge.

zeromus commented 7 years ago

This script can repro the issue: while true do input.getmouse(); emu.yield(); end The cause is memory corruption at some point. You can change getmouse to this to make it crash immediately due to stressing someone's GC harder

public int GetMouse() {
  var buttons = Lua.NewTable();
  for(int i=0;i<10000;i++) buttons[i.ToString()] = 0;
  return 0;
}

Our lua integration is garbage.

8bc067cbbe7bb4b1a46008153232881a107ac4eb seems to cause this.

Fuck this, lua is broken. Don't use lua. Give up. I'll figure this out some day after I cool down. Revert that commit if you'd rather have lua whose brokenness you havent located yet, and crashes due to memory leaks, instead of luas that crash after 2 seconds.

PikalaxALT commented 7 years ago

Not sure if related, but emu.frameadvance() and emu.yield() return a .NET exception on the CGB core when called inside an event callback:

error running function attached by the event OnMemoryExecute
Error message: A .NET exception occured in user-code
zeromus commented 7 years ago

I've begun testing NLua+KopiLua in place of LuaInterface+Lua. Please evaluate.

vadosnaprimer commented 7 years ago

It doesn't seem to like dropping optional parameters (tried with gui.drawBox and gui.pixelText).

text(1, 1, emu.framecount(), "white", "black") won't work, it wants tostring(emu.framecount()) for text output.

Also right now lua is 4 times slower than with LuaInterface. My script that had heavy calculations and was sped up by caching them, runs at 40fps before https://github.com/TASVideos/BizHawk/commit/8bc067cbbe7bb4b1a46008153232881a107ac4eb, and now it runs at 10.

Isotarge commented 7 years ago

The good news:

The bad news:

I've committed some small changes to ScriptHawk to get the DK64 module booting with the current dev build for testing

zeromus commented 7 years ago

I wrestled with it a bit more and I think I solved all the issues youve pointed out, except for the performance, which I don't promise we can ever fix. It is not appropriate to have an unmanaged scripting environment running with the potential of stomping all over our deterministic cores, is it? That's just life. (but: please test again, maybe i committed a debug dll last time)

At the present I haven't got any clues about why the old luainterface+lua based code was crashing. I was hoping I could see nlua+kopilua crashing in a more comprehensible way and get clues, but it hasn't worked out that way.

The only tiny clue I have is maybe finalizing lua objects was causing the interpreter to get touched (from another thread concurrently with main thread)... but I had previously added code to make the finalizer put work in a queue to get done in the main thread, so.... well, if we investigate any more, lets look extra hard for that.

vadosnaprimer commented 7 years ago

Optional arguments and strings are fixed, but speed is the same.

zeromus commented 7 years ago

I managed to get nlua+keralua+lua running, but it mysteriously crashes after some time, same as the old luainterface+lua.

While watching this happen i remember I decided maybe there were bugs in luainterface (and therefore in nlua) which made it think it had a reference to something it didnt, such that lua's GC is cleaning it up without the bridge part knowing. Because now I remember getting angry at luainterface and thinking "geeze I should just rewrite this myself entirely". So why doesn't it happen with nlua+kopilua?? I dont know. Maybe because the dead objects still exist in the .net heap, whereas in unmanaged lua, dead objects are pointing at the live stack, where other things may be parked. So it's possible (likely, even) nlua+kopilua is just avoiding some kind of bug (in nlua).

Could we add diagnostics to kopilua to detect when a supposedly lua-erased object is touched from nlua (because it thinks it's still alive?)

But for now, my best effort is a slow lua that doesnt crash.

You guys are on your own if you want faster luas that crash or leak memory -- you need only revert my commits over the past year or so

vadosnaprimer commented 7 years ago

I mean what was the actual problem https://github.com/TASVideos/BizHawk/commit/8bc067cbbe7bb4b1a46008153232881a107ac4eb was trying to fix (as in, how do I reproduce it)? I don't even remember having bugs with reloading scripts I created (and I did tons of reloading).

zeromus commented 7 years ago

You didnt do a few dozen, then. This particular case was 100% rock solid reproducible at about 45 restarts, due to lua's default stack getting blown. For ANY script.

Isotarge commented 7 years ago

Strings are back to normal so I reverted the ScriptHawk commit from yesterday. No sign of a memory leak anymore too.

The performance issue is the dealbreaker for me here, several of my users are on weaker machines and even with the much faster LuaInterface + Lua builds ScriptHawk was only just usable for them. With the current state of performance I don’t think I can get my work done on this new backend.

I appreciate the hard work you've put in to implementing the new backend and it certainly was an interesting and worthwhile experiment, but if nothing can be done to improve the performance I’d prefer reverting back to before https://github.com/TASVideos/BizHawk/commit/8bc067cbbe7bb4b1a46008153232881a107ac4eb until the “few dozen reloads” bug can be fixed without crashes, hardlocks, or crippled performance.

If you don’t want to revert I can understand, and in that case I will distribute a custom build with the old backend until performance improves, and continue recommending 1.12.2 if my users don’t want to touch a custom build.

vadosnaprimer commented 7 years ago

Pretty much what Isotarge said. Only basic scripts run at full speed, anything more complex can't.

You didnt do a few dozen, then. This particular case was 100% rock solid reproducible at about 45 restarts, due to lua's default stack getting blown. For ANY script.

I took the script I linked above and this build, and reloaded it 500 times in one go. With rewind disabled, in 10k frames it took some 25 MB of RAM, about as much as it takes without reloading. Several other approaches with about 500 reloads total, and no crash.

zeromus commented 7 years ago

https://pastebin.com/tMUShQV9 @projectrevotpp - 20 or 30 times, me - about 20 times me later - about 45 times exactly every time for a very predictable reason of default lua stack size (immediately after this) commit 8bc067cbbe7bb4b1a46008153232881a107ac4eb and crashes stop

The outcome of this is not going to be "well nobody can reproduce the crash, so let's revert that commit".

If you people can't stand the performance, the outcome is most likely going to be replacing lua with javascript or something managed

But there's still a chance we can debug the problem in the most latest luainterface+lua version with kopilua's help

Another possibility in theory is rewriting luainterface/nlua. But I'm not going to do that. if I go to that much work, it'll be to end up with a secure, managed scripting language.

zeromus commented 7 years ago

update: check latest build, the dlls were unoptimized. speed may be repaired.

vadosnaprimer commented 7 years ago

After a test I get 25% of luainterface speed is now 50%.

Isotarge commented 7 years ago

There's some speed boost in latest master, 45 -> 65 on my machine.

I'm keen on the idea discussed in IRC: Giving the user the choice between backends, the default should be slow + reliable but having the option available to swap back to before https://github.com/TASVideos/BizHawk/commit/8bc067cbbe7bb4b1a46008153232881a107ac4eb with an emulator reboot solves every one of my concerns.

vadosnaprimer commented 7 years ago

Recent commits fix these bugs: If you need speed and same experience as with 1.12.x, pick LuaInterface in Customize dialog. If you want no memleaks or memcorruption, at the cost of lower speed, pick NLua there.

Isotarge commented 7 years ago

Thanks for all of your hard work on this issue

vadosnaprimer commented 7 years ago

And with the current master I can't even get luainterface to crash to memleaks with the method zeromus provided. This has probably been fully fixed finally!