KidsDontPlay / Storage-Network

12 stars 55 forks source link

Hard crash on login after server idle #55

Closed xannor closed 8 years ago

xannor commented 8 years ago

I get a hard crash occasionally on login when my server have been idle for an hour or so.

Crash Report

After a restart this does not occur though I do occasionally see server could not keep up messages after initial login.

KidsDontPlay commented 8 years ago

I really don't know what causes this error. But I guess it's not my fault

xannor commented 8 years ago

Since the stack flows through your code (and a whole bunch of recursive calls) and this did not ever occur until I started actively using storage networks, I would say this mod is the cause. My thoughts are either the recursion is too taxing on the stack space, or the code your calling into at the end of the recursion is either receiving bad data or being called in a bad sequence. Since I am not familiar with Minecraft innards I cannot guess what the code is doing after tile tickable or abstract filter tile, I cannot give many suggestions for what to fix there. I do think however, after looking at your code, that a state machine loop would be better than recursion for the wire up loop your init currently runs.

KidsDontPlay commented 8 years ago

You're right. It may be a problem with memory cause it's just an 'Error'. An 'Error' could be anything. So it's hard to find out what was going wrong. I guess I can remove the recursion.

xannor commented 8 years ago

I thought it was in the crash report, but the error is that a tick was taking too long and the watchdog killed the instance. unfortunately it seems that the crash log is not that obvious about that and the error message is only logged to the console as the log writer seems to get killed before it can commit it, the logs just end with no sign of shutdown or error. I've had this happen three times so far, and I posted the second crash report, I lost the first. After looking at the second crash report, I am beginning to think it is not memory but an ordering issue or something similar, so removing the recursion may not fix it, though that should be kept as an optimization goal. I think it is a chunk loading issue. I play on a private server running in my home. It is just me and my base is near spawn. I wonder if the code is trying to walk into unloaded, or lazy loaded chunks and calling into something that freezes because the chunk is not loaded properly? This always happens after a log/out login, never after a restart of the server and I can log I, but either the chunk I am in is missing or the chunks around me are, and then I get kicked out because of the instance halt.

While typing this I was loggin in and got a fourth crash and in both the third and fourth the recursion was not visible but I think I may have an ideal of the "vector".

Crash Report #2 Crash Report #3

It looks like enderio is sending a block update due to redstone state change to a block of yours and that propagation path through your mod is where the error is occurring. In my build I know exactly where that is as I have only once space that has enderio redstone conduit and your mod interacting. I have an item indiciator for set to emit when 800 oak logs are in the system to stop a tree farm, and the signal is passed via enderio redstone conduit. I removed it and did a logout, wait a few minutes, and log back in and I did not crash, though that may not have been enough time to unload chunks. I will play a little and then log out for a couple of hours and report back if I crash or not.

xannor commented 8 years ago

Ok, I am fairly certain that the Item indicator, ender io conduit pair is the culprit. I have done a couple of logins and log outs, afked, and been offline for a couple of hours and no crashes with the farm disabled and the block removed. I might try to put a vanilla redstone in between but I don't if I'll have time before I'm gone for a week.

mindforger commented 8 years ago

@MrRiegel care to explain how you prevent an infinite loop with the last call ath the innermost if ?

private void setAllMastersNull(World world, BlockPos pos) {
        ((IConnectable) world.getTileEntity(pos)).setMaster(null);
        for (BlockPos bl : Util.getSides(pos)) {
            if (world.getTileEntity(bl) instanceof IConnectable && world.getChunkFromBlockCoords(bl).isLoaded() && ((IConnectable) world.getTileEntity(bl)).getMaster() != null) {
                ((IConnectable) world.getTileEntity(bl)).setMaster(null);
                Util.updateTile(world, bl);
                setAllMastersNull(world, bl);
            }
        }
}

ah wait you set the Master to null now i get it, misread the condition as == null the whole time

KidsDontPlay commented 8 years ago

I check if the master is null and if not I make it null

KidsDontPlay commented 8 years ago

@xannor the author of wrote that the conduits may cause the crash. So we have to wait

KidsDontPlay commented 8 years ago

@xannor crazypants released a new version. Could you try this and give me feedback?

xannor commented 8 years ago

I got back and upgraded, logged in and redid the setup. I got a different crash this time. Crash Report

Though this was on server restart, and after a second restart everything seemed fine. However, the whole issue has been slightly rendered moot because it looks like Ender IO upgraded to the new forge inventory handlers, but progressive automation (my tree farm) has not so now they cannot interact so I cannot fully test.

KidsDontPlay commented 8 years ago

fixed in next version

xannor commented 8 years ago

Well first login this morning I got this, same issues as before chunk load fail and crashlog:

Crash Log

This time however the ItemIndicator was removed so I am not sure where the interaction that caused the failure was.

xannor commented 8 years ago

Ok, ignore the statement about the inventory handlers, noticed that the item conduit had a disconnect in it, not sure if it was from the crashes or troubleshooting so I put the system back in place and as soon as the farm disables I will try again so we can see if I get the same results with the same setup.

xannor commented 8 years ago

Looks like the problem was not fixed by the recent ender io update.

Crash Report

Server was idle for about 2 hours and when I logged in, there it was.

Also after restart it crashed, upon login, with the same ticking entity error you said will be fixed in the next release, Second restart and login worked.

KidsDontPlay commented 8 years ago

Mmh.. I don't know what to do now. So the problem isn't the indicator but it could be a problem with my blocks in general

xannor commented 8 years ago

At this point I feel it requires tracing to really see what is wrong. I can fairly consistently reproduce the error, but without knowing what is being called in that chain it would be very difficult to debug. From what I understand of the error, the problem is that the tick is taking to long, the console usually has an entry about it taking 64+ seconds. So, some operation in that chain is locking the task, either waiting on a lock or an event state, at least that what it feels like. It may even not be something you are directly calling, but perhaps your code is initiating something that is taking too long due to locks or events.

My best guess is that your Util.updateTile call is possibly calling into code for an unloaded or not fully loaded chunk and locking or waiting for the chunk to finish, but since it is a tick, the chunk is frozen until after the tick? I am not familiar with minecraft's internals so I am just guessing from observation.

xannor commented 8 years ago

If you want/can/are willing to build, I would run a dev version to see what the error is. I think if you log around this line: https://github.com/MrRiegel/Storage-Network/blob/1.9/src/main/java/mrriegel/storagenetwork/helper/Util.java#L159 With the xyz and entity info (if possible) and a log when returned (for timing), I can give you the relevent entries and we can sus out what is failing.

KidsDontPlay commented 8 years ago

ok, I added some logs and check if the chunk is loaded. you can clone and build it. But I'm not sure if it's the correct piece of code. Not every crashlog contains the 'updateTile()' method. However every log contains 'setConnections()' method

xannor commented 8 years ago

I dont do minecraft or java development, so I dont have a setup ready to build the jar. I can possibly build it on my server (it is a i5 ubuntu box) but I would need a list of steps on the prerequisites to prepare the build environment.

KidsDontPlay commented 8 years ago

https://drive.google.com/open?id=0Bxk8bebSTRk-OHRyMml3a0lHSkU

xannor commented 8 years ago

Ok, this was some interesting, unexpected, and possibly not incredibly useful results.

I installed the mod in the server restarted, and then discovered I had to install the exact same version to the client in order to connect (I was thinking I could just drop in on the server.) I then had to re-create the setup as I had changed the farm and storage and was just using a lever to disable it. When I added the itemindicator the logs blew up, in fact I could not configure it as I started getting kicked from the server. It was not a timeout or a crash I would just get disconnected. This would only occur if a redstone update happened. It first happened when I tried to turn off the lever and would happen every time I flipped it. I stopped everything, reverted back to the last release, and could log in.

Then I rebuilt the system and did some tests to make sure I could reproduce the problem, once I got to that point, I had a serious hard crash, I had to kill the java instance as my system was pegged. I then, re-installed the dev you provided, and reloaded, here is where it got interesting.

When I would log in I would get the pause, but then everything would load and the server would report being overloaded and skipping a ton of ticks. I did this a couple of times, including fully existing my client and relaunching it. Always the same results, no hard crash, just a delay then everything loading.

This unfortunately kinda puts us back to square 1 as other than the itemindicator, I cannot see any definite cause or bug. My only thoughts are that maybe you are being to aggressive with item updates, or that another method to walk your network checking for changes might be needed.

Log Files (6.5Mb zipped, 200Mb unzipped)

fml-server-2 is the first run with your dev version, fml-server-3 was run without that I reproduced the error in fml-server-latest was the last run with your dev version.

KidsDontPlay commented 8 years ago

ok... I have to reduce the updates. How many cables do you have in your network? the lags could be caused by the logging. (printing 3000 lines per seconds needs much CPU)

I will setup my own server to check how I can fix it. It's very laborious to work when I get the answer after 24h. Can you show me a screenshot of your setup?

xannor commented 8 years ago

I seem to be having trouble uploading screenshots but I did grab the chat that is generated when you click on the controller.

[19:22:23] [Client thread/INFO]: [CHAT] §3Connectables: 205 [19:22:23] [Client thread/INFO]: [CHAT] §b Link Cable: 176 [19:22:23] [Client thread/INFO]: [CHAT] §b Item Storage Cable: 13 [19:22:23] [Client thread/INFO]: [CHAT] §b Item Export Cable: 4 [19:22:23] [Client thread/INFO]: [CHAT] §b Fluid Storage Cable: 4 [19:22:23] [Client thread/INFO]: [CHAT] §b Fluid Export Cable: 3 [19:22:23] [Client thread/INFO]: [CHAT] §b Item Request Table: 2 [19:22:23] [Client thread/INFO]: [CHAT] §b Fluid Liquidator: 1 [19:22:23] [Client thread/INFO]: [CHAT] §b Fluid Request Table: 1 [19:22:23] [Client thread/INFO]: [CHAT] §b Item Import Cable: 1

and the step spans a 3x3 chunk grid and about 15 blocks on the Y. It is a very large network and I am probably exceeding your design.

xannor commented 8 years ago

Ok, I cannot upload screenshots, but I loaded your dev module and went single player and got some interesting results. I created a tiny network with a controller, a link cable out of each end, an item indicator at one end and a storage cable at the other attached to a chest. I set the indicator to emit when more than one item is in the chest, and I put two items in. I then rand a redstone line next to the cable, from the chest past the item connector and next to 1 peice of cable. When I powered the redstone, I checked the log, there were 99 entries which means your updateTile log code was hit 33 times.

Nothing the redstone is next to should be effected by a signal but the entire network is rescanning multiple times. and 33 updates for a 4 entity network is alot, no wonder my setup crashes and lags.

Screenshot of test rig

mindforger commented 8 years ago

Vanilla redstone does not simply toggle between powered and not powered, it drops level by lvl and spams redstone updates for each level, no idea how it behaves with switching it on

guessing by your picture with 2 cables in contact, 16levels X 2 cables + 1 event on switch on = 33 updates

fail ... 2 switch on updates + 15 on the far cable and 16 on the near cable

xannor commented 8 years ago

If the logging he was doing was logging received updates, you would be right, no indication, but he is logging sent updates and to what block, and one of the block that is getting updates is the item indicator block, which is more than two blocks away from the redstone that I changed state on, and the redstone next to the indicator should not be changing state since it is being set by the indicator (which is based on the inventory of the chest which is static the entire time). Also with that tiny setup there is a noticeable pause when the lever is switched off, but nothing unusual when it is switched on. In fact, as far as I can tell, they appears to only fire the updates when the redstone is turned off.

The point I was aiming for was that I was trying to find the smallest example of the underlying problem so MyRiegel has something to work with to debug the issue.

edit: it took me a bit to see what you were getting at, and you seem to be right as far as redstone goes, and it seems that when turned on redstone just "jumps" to the level but turned off it "fades".

KidsDontPlay commented 8 years ago

thanks for your comments. Yeah, I definitely need to decrease the updates. Every cable (and other network blocks) checks if the neighbor changed and refreshes the network. The problem is that I don't know what the new neighbor block is. So I need to refresh even it's just a redstone wire update.

xannor commented 8 years ago

Maybe you should talk to the guys who maintain tech reborn. They implement a cable network as well and I think they have a way to detect block placement and then have the network rebuild when a valid connection is added. Another possibility is to just have the cables set a flag on the master to rebuild the network and let the master be deterministic about how often it walks the network.

In the meantime I have been trying to isolate the cables from redstone to see if it helps. the last thing I have to try is to again remove the item indicator as that is currently the only thing within 1 block of redstone and I still crash on chunk load. I tried separating it using a repeater into a block and rf tools wireless transmitters/receivers but I still crash on chunk load. I'm going to remove it to see if the crashes stop.

mindforger commented 8 years ago

@MrRiegel can't you add an filter for what kind of changes you toggle or some sort of input-filtering?

as long as there are Updates that do not change the actual state of the network block you won't need to update the whole network, except for changes that trigger a part (Redstone going from 1 > 0 on some sort of RS senstive part) and then update the whole network?!

KidsDontPlay commented 8 years ago

next one https://drive.google.com/open?id=0Bxk8bebSTRk-UFJxZ3hwWUY4WWM

I removed some client syncing that wasn't necessary anyway. And the network will be updated if the neighbor block is a tile entity (inventory, fluidtank, etc) or air. It doesn't trigger an update if a redstone signal is applied.

xannor commented 8 years ago

Ok, I ran it, however the test on the server is no good because I think the logging it does keeps the thread alive and causes the server to skip ticks, which seemed to prevent the crash with the last dev build.

I did however test the single player instance and i notice no delay now on redstone change, and the log entries are much smaller, about 1/5th what they were before. Could you comment out the logging you added and give me another build? I feel this will work but I would like to see for sure.

KidsDontPlay commented 8 years ago

removed logs https://drive.google.com/open?id=0Bxk8bebSTRk-QkVmYXhDNl9kSFU

xannor commented 8 years ago

Ok, so far, so good. Logged in loggoed out waited a few and logged back in ok. I then reconnected the item indicator, logged out waited a few minutes and logged back in and no crash. I want to give it more time, like logged out around dinner time in a few hours, but this looks very promising.

update: went to the nether and when I came back it froze on chunkload like its done in the past and I thought it was going to crash, but after a few seconds (around 15-20ish) everything resumed, so no crash so far.

xannor commented 8 years ago

Had to take a break for a bit but just logged back in and all seems good. Still get the chunkload pause, but no crash.

mindforger commented 8 years ago

@xannor this sounds a bit like a combined issue, chunk loading fro a probably slow hard disk and maybe unrefined memory configuration resulting in a ginormous garbage collect

you could probably have a look at my command line, maybe it helps, but be aware you just trade off the issue, it may get laggy in other situations then, always try out everything https://github.com/mindforger/ImmersiveEngineering/issues/1

xannor commented 8 years ago

I agree it is a combined issue, in fact on @MrRiegel s' issue he submitted to EnderIO someone reported a similar problem, not involving @MrRiegel s' mod so there is an EnderIO/redstone issue as well.

As for the system i'm running, it is a dual core i5 with 16gb ram, a m.2 SSD (around 100G) running Ubuntu Xenail headless and not much else other than mineos to manage the instance, though it will check out your cmdline as the only options I am running is reserving 6G memory for the instance (usually 4 is enough but during this testing I had to bump it.)

I also wonder if some of it is 1.9.4 issues, I did a test on 1.10.2 and I did not crash, and had less lag spikes, and this was with the release version.

Either way I think we have managed to help @MrRiegel improve his mod.