MightyPirates / OpenComputers

Home of the OpenComputers mod for Minecraft.
https://oc.cil.li
Other
1.59k stars 431 forks source link

SIGSEGV (0xb) at pc=0x00000000000099a6 with native lua interpreter on server (linux) #436

Closed phoxmeh closed 10 years ago

phoxmeh commented 10 years ago

I can reproduce this a lot and it's quite inconsistent sometimes. I'm on linux running the latest java version (also tried previous versions back to 7u55) and I get this seg fault on the server I run. It seems to mainly happen only when I'm running any other minecraft server and/or client on the same machine. As expected everything works when forcing LuaJ but when using the native interpreter it causes an inconsistent crash (works for a few days and then constantly crashes and then stops crashing and works again with the native interpreter) I'm running it with other mods from the CrackPack on the atlauncher. It always reports "SIGSEGV (0xb) at pc=0x00000000000099a6"

Here is a link to the hs_err I got when trying to get it working last time: http://pastie.org/private/w3mgfabfkivvqzcab8bgcg I've tried both openjdk and oracle java, both have the same issue.

fnuecke commented 10 years ago

Thanks for the report.

Does this have the same stacktrace whenever it happens (i.e. the last Java frame being com.naef.jnlua.LuaState.lua_isthread) or does that vary?

Also, could you check the used native lib is the 'right' one (...-native.64.so from the looks of it)? The code to determine which one to use got so bloated I decided to just have it try each one until one works, but it may be that causes unexpected side effects... have you seen this in an OC version pre 1.3.0 / build 505?

fnuecke commented 10 years ago

Oh, also, to narrow down the source of the issue a bit, could you try to reproduce it after changing the following config options (ideally one by one, to narrow it down even further, but feel free to start with enabling all to see if any of them do anything) - disableUserdata, disablePersistence, disableMemoryLimit. As the names indicate (and the comments in the config), these lead to reduced functionality, but it'd help me a lot to get an idea of where to start looking. Thanks!

XDjackieXD commented 10 years ago

I got the same problem running a dedicated server on ubuntu 64-bit (computers - most of the time - crash the server with a sigfault when turned on or loaded on server start). Enableing LuaJ fallback solves it. I will try these settings in a minute.

Update: Disabeling persistence seems to solve the problem... (after several restarts of the computers and the Minecraft server it crashed not a single time)

phoxmeh commented 10 years ago

It looks like a lot of them end at com.naef.jnlua.LuaState.lua_newstate. I've tried disabling disableMemoryLimit but not the other two. The Library seems to be the correct one loading. This has been an issue since 1.3 (although I didn't have any versions before that on a server)

@XDjackieXD And the persistence is my favourite features too! XD

Also disabling the persistence has not helped me. I've tried disabling userdata and the memory limit (and all at once) and I get the same result.

XDjackieXD commented 10 years ago

@phoxmeh yeah persistance is a really cool feature...

@fnuecke when having persistanec enabled and loading a world with a computer in the on-state the error happens at "com.naef.jnlua.LuaState.lua_newstate(IJ)V+0" (according to the hs_err_pidxxxx.log)

fnuecke commented 10 years ago

If this is reproducible for a certain world, would it be possible for you to send me either the <savedir>/opencomputers/state folder or the whole world? Maybe I can reproduce it with that and get some more info out of the crash. Thanks!

phoxmeh commented 10 years ago

Mine still crashes unless I force it to use LuaJ :/ which is strange since it was working perfectly fine with the native library for some time before I went to reboot the server.

I don't have any data under any of the .../opencomputers/state sub-directories. My current world is about 900+MB uncompressed. I could compress it to an archive you prefer (unless you don't mind a tar.7z archive since that's how I backup) and send you a copy.

fnuecke commented 10 years ago

I don't have any data under any of the .../opencomputers/state sub-directories.

That's interesting, actually, since that points to something being broken in JNLua, not in Eris (which might still be my fault, ofc [/disclaimer]). The world should be pretty irrelevant then, but thanks for offering! I'll try rebooting my test server a bunch of times, will see if something pops up.

Otherwise I may have to generate some custom builds so you can test with those. We'll see. Thanks for your patience!

XDjackieXD commented 10 years ago

I will test it tomorrow using a clean installation of OpenComputers on Minecraft 1.6.4 (I am currently testing using the latest experimental snapshot of the yogscast pack)

And btw: In my state folder there is an empty folder called "0" but nothing else...

Update: Can't reproduce it using a fresh install of OC 1.3.2.... (Also I don't get the crash in my old world using the yogscast pack anymore.... strange....)

XDjackieXD commented 10 years ago

Karma hit me... :D The second I updated my previous comment the error hit me again (Yogscast Pack OC 1.3.1.516)... Disabeling persistence fixed it again.

http://pastebin.com/Drg60xtG

fnuecke commented 10 years ago

Thanks for that log. Though the persistence setting shouldn't have any influence at that point, anything's possible with segfaults :-P While I'm kind of expecting that to be by chance, I'll see what I can find. I haven't been able to reproduce it, yet, though, so it's kinda purely going through whatever I can think of mentally right now... I'll try with the full pack when I have the time, just in case.

XDjackieXD commented 10 years ago

I cannot reproduce it using a clean install... Idk why but it may be related to anything in this pack. But disabeling persistence definatly fixes the crash (reproduced it several times using the latest yogscast pack)

fnuecke commented 10 years ago

All right. In the worst case it's something triggered by the garbage collector running more frequently because of the higher memory load from the pack... that would be fun to debug...

phoxmeh commented 10 years ago

I'm using The Crack Pack, not sure what mods are shared between it and the Yogscast pack. But I do know it's very inconsistent when it decides to crash for me, albeit now it crashes always. I did find that when someone started spamming world anchors around it was crashing more often (I promptly disabled them). Also it seems to crash more reliably when I'm running both the client and server at the same time. Right now it just seems to always crash when using the native lib, if the computers are on when using the native lib it will crash, just usually quicker or immediately when I'm running my client too. Haven't tried it in a day so it might decide to work for a couple days cause that's what it seems to do for me.

fnuecke commented 10 years ago

Eh, it's probably less the packs but their effects on the runtime (having tons of classes loaded, using lots of memory). That's my guess, anyway. Which Linux distros are you running, by the way?

phoxmeh commented 10 years ago

I'm running an arch64 install.

XDjackieXD commented 10 years ago

I'm running Ubuntu 14.04 64bit desktop.

In my case the crash occurs about 60% of the times i turn on a Computer and everytime a chunk with a turned-on Computer gets loaded (until I disable persistence).

Kilobyte22 commented 10 years ago

here a small suggestion: enable core dumps using ulimit -c unlimited on the terminal, as the log suggestes. next time it crashes you will get a file named something with core in your ~/.minecraft (for servers in the server dir). Send that file to @fnuecke. It contains debug information he can use to track down the bug. He will probably send you a debug build without stripped debug symbols in a bit, so wait for that. We already talked about that in IRC

XDjackieXD commented 10 years ago

https://mega.co.nz/#!OFcmASKY!XZUvz1j7jeVn8Z-p700LYMrqAwVNG51TDpDCrrsMfeY 275MB of compressed coredump goodness :D

phoxmeh commented 10 years ago

So my core dump is 2.9GB.... this might take a while XD edit: so it compressed better than i thought so hopefully i should get it uploaded... wish my upload speed was better

fnuecke commented 10 years ago

Awesome, thanks! I'll see what I can glean from that :-)

phoxmeh commented 10 years ago

https://copy.com/idJj28NApPvl sha1sum: 3b4864c45f01b3d11708b901aa4fcd884ce412dd

there is mine, hope that works XD

fnuecke commented 10 years ago

Thanks! I'll probably need to set up a VM with arch to make use of @phoxmeh's coredumps, but at least I have the Ubuntu VM at a point where everything "environmental" fits the coredump, from what I can see. So: slow progress, but progress.

I've been running the Yogspack on the Ubuntu VM for ~12h now, with one computer being stopped and then started again every ten seconds, but no crash yet, so I'm afraid I'll have to rely on you for helping me solve this for a little longer.

I'll have to have you both use this custom built version with debug libs to get anything useful out of the dumps, though. Please switch out the OC JAR in the packs with this one and get me a coredump, that should tell me a little bit more then! This was built on the Ubuntu VM, so I'd primarily need @XDjackieXD to get me a coredump with this. @phoxmeh you could at least verify it still happens with that build, you can probably save yourself the time and bandwidth of uploading a coredump until I get an arch VM up and running :-P

Obligatory warning: the custom build is based on the dev branch, so make sure you have a backup of your world, just in case.

Thanks again for your patience and cooperation :-)

XDjackieXD commented 10 years ago

https://mega.co.nz/#!OVkVwRxZ!q_phLqlZ7pc__oPCtNIYTF6nAF3yaQTMaaErT5_bbao

Here you have the coredump and error log (wich gives a little more detail this time :) ) At least the crash is reproducable over different versions of oc...

phoxmeh commented 10 years ago

I did it anyways :P https://copy.com/4IruU8pIGhgH sha2sum: d67280f8e2403b41856c2a544bec60af94ff63b1

XDjackieXD commented 10 years ago

No idea what I changed on my setup (could a kernel update have something to do with it?) but now the crash can only prevented by enabling LuaJ fallback...

fnuecke commented 10 years ago

Hmm, maybe? I had a look at the debug dump, and I'm afraid it's not that helpful :-( The crash itself, from what I can tell, happens inside some magic code that takes care of thread-local variable assignment (it happens on this line, which is really just this; all stack frames after that line are just question marks, even in the debug build).

The main hindrance for me is that I still can't reproduce it, so I can't try to catch the library screwing up red-handed. But I'll keep digging and let you know when I have some new thing to test out.

Actually, one thing you might try, since it seems to be related to thread-local storage at least in some capacity, is to set the number of worker threads to one (from the default four). Yes, that's just a wild guess...

Wuerfel21 commented 10 years ago

As when speaking of mamy classes, could raising Permgen help?

fnuecke commented 10 years ago

Very unlikely.

XDjackieXD commented 10 years ago

Switching it to 1 worker thread seems to solve the problem after several server restarts and half an hour playing...

fnuecke commented 10 years ago

@XDjackieXD to clarify, it didn't crash at all across these restarts? Or it didn't crash after the most recent restart? @phoxmeh could you see if this also applies in your case (i.e. setting worker thread count to 1 avoids the crash)?

If this does indeed help, it means a bit of work, but at least I'd have something to try out.

XDjackieXD commented 10 years ago

It doesn't crash at all after setting it to 1 (I played about 1h and made a few restarts with 2 computers turned on and I switched them on and of and nothing crashed)

phoxmeh commented 10 years ago

Yep, absolutely no crash so far. I'll keep it going for a while, some people should get on a bit more later and if it doesn't crash I'll let ya know but so far it's working fine.

fnuecke commented 10 years ago

Fingers crossed. At least it's a non-catastrophic workaround for now. I'll dig some more through the JNLua code over the weekend and see if I can maybe get rid of the thread-local stuff altogether in a hope of solving this.

fnuecke commented 10 years ago

All right, got rid of the last use of thread-local variables, had it running over the night with no issues to make sure I didn't mess up something else. @XDjackieXD, @phoxmeh please give this version a try and let me know how it goes. Don't forget to up the threads to 4 again. Thanks!

XDjackieXD commented 10 years ago

Im not at home for 2 weeks so I wont be able to test until sunday in 2 weeks. Also I got a new pc on friday so I have a second computer with ubuntu 64-Bit to test with (and a lot faster than in my laptop ;D ).

phoxmeh commented 10 years ago

Well it's not crashing now (after remembering to turn off the computers before starting up with the native lib and threads set to 4) but now it's telling me oc:native libraries not available with http://pastie.org/private/qnvbcggte7whxavltmpw in console

edit: ignore that... after a little tinkering it works fine...

Kilobyte22 commented 10 years ago

@XDjackieXD well, on a different machine it might behave differently. @fnuecke wasn't able to reproduce it on Ubuntu himself afaik

fnuecke commented 10 years ago

@phoxmeh hmmm. Could you post the full log / some more context around it, just in case?

phoxmeh commented 10 years ago

The main thing i did was remove the computers before restarting the server again. Once I did that it worked just fine. Nothing else was reported in the logs as far as I know but I'll be more than glad to upload that forge log if you need it.

fnuecke commented 10 years ago

Oh, OK. Hmm, keep an eye on it, then, please. If after another re-start the above happens again, let me know! (And I'll have another look over the changes I made to see if anything could cause that.)

phoxmeh commented 10 years ago

Well it seems to be failing still @_@ it crashed with this http://pastie.org/private/0k5ipygblrafwghwdkplq and this http://pastie.org/private/xfytvaxg3bbb8j9thyvtxw so far

fnuecke commented 10 years ago

Darn. Well, it was an expected possibility. If you can, would you mind getting me a coredump for this latest build? I'll finish setting up the arch VM in the evening, hopefully I'll be able to get anywhere with that.

Also, just to make sure, the debug settings mentioned above (disable*) still don't have any influence on whether it's crashing or not?

phoxmeh commented 10 years ago

Ok, gonna try to test it out a bit more, sorry I've been quite a bit busy at work and unable to really do any minecraft stuff the past couple days. Currently I'm running the dev version with nothing disabled right now and threads at 4, no crashes yet except for the library failing to load when i first chagned it to the dev version and a small crash when i was shutting it down. So far nothing, gonna try some restarts and play around with it. I'll let ya know how it's running tomorrow.

phoxmeh commented 10 years ago

So I've been testing the server a bit and it has been stable with no crashes and I just went to restart it and it started crashing again like the last time I had issues (running the dev version you gave me). Tried diabling things and setting threads to 1 but that all failed. Only after restarting with LuaJ forced and removing the computers and restarting twice (cause the first time it doesn't load the native lib properly) does it seem to work with everything enabled and threads at 4. The computers are left on when I shut down the server initially, but it doesn't always crash just usually after the server has been up a couple days. During the uptime it's perfectly fine but when I try to start it again it crashes with the same segfault as http://pastie.org/private/0k5ipygblrafwghwdkplq I'll try to reporduce it again and get a core dump (always forget to enable it to get that before I do anything @w@) Lemme know if there's anything else you want me to try and do to figure this out. I'll keep tinkering how I have been to see if i can get that core dump for you.

fnuecke commented 10 years ago

Thanks so much for taking the time to test all this! I'm afraid I didn't have the time to test as much over the weekend as I hoped I'd be able to, but someone brought up an interesting issue that might be related (garbage collection, which might explain why it only happens for existing / resumed computers). I'll investigate this as soon as possible, probably tomorrow.

Side note: if we can't find a robust solution/workaround for this in the next couple of days I'll probably still push out 1.3.3 with this as a pending known issue, due to all the other fixes that have accumulated...

fnuecke commented 10 years ago

@phoxmeh when you have the time, could you please give the latest dev build a try? I'm disabling the Lua GC while persisting / unpersisting now, since that reportedly helped with the other issue I mentioned, so that may help with this, but may just as well be unrelated.

phoxmeh commented 10 years ago

Sorry it's taking me so long work on things, it's been quite the busy week. I got the lately build today, 601 I believe, and it's crashing regardless of settings unless I force LuaJ. I'd removed the computers before doing anything so to make sure they were off and not saved in the world as to casue any problems. But it's crashing the same as it was in the beginning and at the same adress space (0x0...99a6) That last dev build you sent me was working flawlessly until I had to do reboots. I'll keep trying to see if I can get it to launch the computers with the current build though. Just lemme know what else you need me to do.

here is the hs_err: http://pastie.org/private/a727keh23unsuwswxnxhpa

fnuecke commented 10 years ago

No problem, thanks for helping out with this! All right, so it's most likely not related to that other issue. I guess that's kind of good. But it doesn't really get us closer to a solution, either. Hmm. Which version of libc do you have installed? (ldd --version)

phoxmeh commented 10 years ago

GNU libc 2.19