Hidendra / LWC

Lightweight Protection for Chests, Furnaces, Dispensers, and more for the Minecraft server mod Bukkit
Other
167 stars 123 forks source link

LWC often starts lagging server severely with InventoryMoveItemEvent after 12-18 hours running #618

Open OriginalMadman opened 11 years ago

OriginalMadman commented 11 years ago

Symptom: In evenings, after server has been running for the day (autorestart in morning) TPS goes down to 5 and back up then down again. LWC is eating way more resources than it should (had problems for months)

It seems to have no correlation to if we have 5 or 30 players online.

Here are top 10 timings from first 18 hours of operation. LWC clearly in top by factor 8-9:

./timing_checker.py timings77.txt --sort time 4651972639562 8175 568996023 LWC v4.3.2 (b865-git-MANUAL) (April 09, 2013) InventoryMoveItemEvent 556714619840 48385 11505762 ChunkSpawnerLimiter v2.10.0 CreatureSpawnEvent 90657031174 5484 16530674 CraftBook v2123-024220d BlockRedstoneEvent 74279020584 165 447521973 ScrollingMenuSign v2.1.1 BlockPhysicsEvent 66309799132 148 447521973 ManySmallTweaks v1.0.13 BlockPhysicsEvent 66107252843 5745 11505762 SimpleSpleef v4.2 CreatureSpawnEvent 56962560681 13800 4127523 WorldGuard v5.7.3 with custom flags plugin. PlayerMoveEvent 54193195517 121 447521973 WorldGuard v5.7.3 with custom flags plugin. BlockPhysicsEvent 52004746646 116 447521973 Nethrar v2.3.1-dev BlockPhysicsEvent 51207451850 114 447521973 CreeperHeal v6.3.2 BlockPhysicsEvent

When it really starts becoming bad, even over short timing runs - it really is bad (factor 70-80):

./timing_checker.py timings78.txt --sort time 83256746715 43120 1930809 LWC v4.3.2 (b865-git-MANUAL) (April 09, 2013) InventoryMoveItemEvent 1303425917 55771 23371 ChunkSpawnerLimiter v2.10.0 CreatureSpawnEvent 475691192 5019 94772 CraftBook v2123-024220d BlockRedstoneEvent 398162875 171 2316827 ScrollingMenuSign v2.1.1 BlockPhysicsEvent 353363548 152 2316827 ManySmallTweaks v1.0.13 BlockPhysicsEvent 283418055 122 2316827 WorldGuard v5.7.3 with custom flags plugin. BlockPhysicsEvent 279564200 120 2316827 Nethrar v2.3.1-dev BlockPhysicsEvent 279484904 120 2316827 CreeperHeal v6.3.2 BlockPhysicsEvent 264595771 13239 19986 WorldGuard v5.7.3 with custom flags plugin. PlayerMoveEvent 232948536 100 2316827 Multiverse-Portals v2.5-b684 BlockPhysicsEvent

And the last one before reloading the plugin has factor 100!:

./timing_checker.py timings79.txt --sort time 78079348723 51322 1521347 LWC v4.3.2 (b865-git-MANUAL) (April 09, 2013) InventoryMoveItemEvent 764021117 60368 12656 ChunkSpawnerLimiter v2.10.0 CreatureSpawnEvent 531097559 5426 97868 CraftBook v2123-024220d BlockRedstoneEvent 465562108 167 2787339 ScrollingMenuSign v2.1.1 BlockPhysicsEvent 419781853 150 2787339 ManySmallTweaks v1.0.13 BlockPhysicsEvent 332453803 119 2787339 WorldGuard v5.7.3 with custom flags plugin. BlockPhysicsEvent 327041547 117 2787339 Nethrar v2.3.1-dev BlockPhysicsEvent 318693636 114 2787339 CreeperHeal v6.3.2 BlockPhysicsEvent 277382993 99 2787339 Multiverse-Portals v2.5-b684 BlockPhysicsEvent 276157413 99 2787339 Tables v0.6 BlockPhysicsEvent

Using pluginmanager to reload the plugin live works fine, and doesn't lag the server but still uses 2x more resources than the next:

./timing_checker.py timings80.txt 17705736528 4133 4283001 LWC v4.4.0 (b867-git-MANUAL) (April 23, 2013) InventoryMoveItemEvent 7424073669 53223 139489 ChunkSpawnerLimiter v2.10.0 CreatureSpawnEvent 3592361845 5503 652794 CraftBook v2123-024220d BlockRedstoneEvent 2781908232 174 15984287 ScrollingMenuSign v2.1.1 BlockPhysicsEvent 2442242798 152 15984287 ManySmallTweaks v1.0.13 BlockPhysicsEvent 2085577417 1042788708 2 PermissionsEx v1.19.5 PluginEnableEvent 1941009274 121 15984287 WorldGuard v5.7.3 with custom flags plugin. BlockPhysicsEvent 1934979797 121 15984287 Nethrar v2.3.1-dev BlockPhysicsEvent 1911382156 119 15984287 CreeperHeal v6.3.2 BlockPhysicsEvent 1593552559 99 15984287 Tables v0.6 BlockPhysicsEvent

Note during earlier evening, 30 players on - 20 TPS. On 2nd and 3rd dump there are less than 10 players online.

OriginalMadman commented 11 years ago

And yeah, we're on 1.5.2 beta 1. But the problem has been there for quite a while. And yeah, I upgraded on the fly for the last one... same behaviour regardless of upgrade or not.

ChrisLane commented 11 years ago

Same issue here and I'd really appreciate a fix, losing players over this. Here's a link to the nolagg examine file showing InventoryMoveItemEvent taking a hell of a long time: http://wikisend.com/download/138946/2013_05_15-23_25_22.exam

ChrisLane commented 11 years ago

If you don't want to download the file, here's my timings analysed: http://aikar.co/timings.php?url=5669040

OriginalMadman commented 11 years ago

2013-05-18_21 47 48

These insane amounts of queries keeps popping up when the lag is active (probably what is causing the lag). The server had 4-5 TPS for hours when i logged on (a pm reload of LWC fixed it)

Pretty please with sugar on top, please track down these problems in LWC that have been plaguing the community for too long! There are obvious bugs or errors that have serious implications on performance. Thanks!

Hidendra commented 11 years ago

Your cache is full so when LWC sees that all protections are cached it never queries the DB when doing normal protection selects, so getting that many queries should be impossible. I double checked and yeah, I'm not getting any queries used at all in similar scenarios. The only thing that uses the InventoryMoveEvent event are hoppers.

When this happens could you please run SHOW PROCESSLIST to see some of the queries happening at the time for LWC (it only maintains 1 connect to the database so you may need to run it a few times)

ChrisLane commented 11 years ago

SHOW PROCESSLIST being a MySQL command ;)

OriginalMadman commented 11 years ago

No actual DB calls happening. I think the high amount of queries keeps happening because of a problem/error state in the plugin. It isn't calling the MySQL at the time (or it would go bonkers I think), just querying the cache so hard the server lags.

The cache is not full, just half full. It was tuned to 50k since I had just above 40k protections before cleanup. Reducing the amount of protections seems to really have helped the error state not to happen at all. It has only happened once AFAIK (the screenshot) since I deleted those protections (which I'd rather have kept - but getting kinda desperate here, not having a reliable server at all does that to you).

If/when it happens again, I will make sure to verify again in MySQL.

I see no other way than this being a code bug somewhere - perhaps latent and only triggered in certain conditions. It may even in something just outside LWC, but my guess would be on the inside.

basicmark commented 11 years ago

I've done some black box testing on test server and if I create a hopper loop (22 hoppers in the loop) I can see that over time more and more requests get made per second (and I realize that the operation's per second never gets reset so I used a timer ;)). After leaving the server running for 30-40 min it there where about 6500 operations per second.

Breaking and remaking the loop gave me 4000 operations per second. Having the same amount of hoppers not in a loop also yields around the same figures.

Having two hoppers, one of which feeds into another which then feeds into nothing I get about 19 operations per second which is still much higher then then 5 I would expect. A hopper into chest also yields the same figures.

So there are a couple of odd things here:

1) Why are so many accesses made, it almost looks like it is done every 2nd rather then every 8th tick? 2) There seems to be some feedback loop where, more and more protection checks happen for the same number of move events.

Edit: In this testing the cache was not full (about 25% used) so once the protect was in the cache it would stay there.

dddeeefff commented 11 years ago

My testing on some hopper-caused lag showed me that unprotected hoppers (only about 20 of them) could cause lag that dropped tps from 20 to as low as 5. Once I privated all the hoppers, the lag went away. I've noticed this repeatedly.

basicmark commented 11 years ago

I'll test again but I don't remember seeing the same result in my testing.

Do you have auto protect enabled, and thus have to unprotected them? I do which might make a difference compared to the hoppers never having protection.

How are the hoppers setup?

Do you see more and more protection requests made over time (/lwc admin report) and if so is it the cache or the database they're hitting?

I'm hoping by getting more data we might be able to track down the issue :).

OriginalMadman commented 11 years ago

On my side, I have hopper protection turned off in config due to earlier lag reports (that might be it? need it on and private?).

Chests, furnaces, dispensers are autoprotected.

It usually looks like this, then increases. Huge amounts of queries - moving to insane later. This is from right now (server has been up 18-19 hours, only 7 players logged in - one of the admins probably reloaded LWC due to lag earlier so it shows 3h 35m).

00:47:27 [INFO] LWC Report 00:47:27 [INFO] Version: 4.4.0 (b867) 00:47:27 [INFO] Running time: 3 hours 35 minutes 10 seconds 00:47:27 [INFO] Players: 6/66 00:47:27 [INFO] Item entities: 5/1333 00:47:27 [INFO] Permissions API: VaultPermissions 00:47:27 [INFO] Currency API: VaultCurrency 00:47:27 [INFO] 00:47:27 [INFO] ==== Modules ==== 00:47:27 [INFO] LWC v4.4.0 (b867-git-MANUAL) (April 23, 2013) -> 47 registered modules 00:47:27 [INFO] 00:47:27 [INFO] ==== Database ==== 00:47:27 [INFO] Engine: MySQL 00:47:27 [INFO] Protections: 24,593 00:47:27 [INFO] Queries: 41,988,288 | 3252.38 / second 00:47:27 [INFO] 00:47:27 [INFO] ==== Cache ==== 00:47:27 [INFO] Usage: 40.99% ( 24593/60000 ) [ Reads: 1,435,736,992 <=> 111211.23 r/s ] [ Writes: 50,161 <=> 3.89

Note, this does not (yet) lag the server noticably. But we are running 80% cpu thread on a 3.4Mhz i7 2600 with only a few players on. As you can see here, LWC is EATING us alive on the timings by several factors for today (=everyday):

./timing_checker.py timings82.txt --sort time 1534751927864 5713 268616353 LWC v4.4.0 (b867-git-MANUAL) (April 23, 2013) InventoryMoveItemEvent 470840390896 58775 8010818 ChunkSpawnerLimiter v2.10.0 CreatureSpawnEvent 91297208807 11396 8010818 SimpleSpleef v4.2 CreatureSpawnEvent 73047698144 231 316067683 ScrollingMenuSign v2.1.1 BlockPhysicsEvent 67943762552 8481 8010818 WorldGuard v5.7.4 with custom flags plugin. CreatureSpawnEvent 62332012083 197 316067683 ManySmallTweaks v1.0.13 BlockPhysicsEvent 53272138830 168 316067683 CreeperHeal v6.3.6 BlockPhysicsEvent 52641537945 27900 1886732 Towny v0.83.0.0 PlayerMoveEvent 48904489610 154 316067683 WorldGuard v5.7.4 with custom flags plugin. BlockPhysicsEvent 47800663232 4973 9610493 CraftBook v2153-90c8f3c BlockRedstoneEvent

There is nothing really happening in mysql, it's all in the cache. SHOW PROCESSLIST: lwc 127.0.0.1:58512 lwc Sleep 72 NULL

LWC has its own database - and none of the other 11 connections to other databases eat anything big either.

That clearing out of protections from 40k to 23k is kinda what is keeping us above water until server reload in the mornings. As number of protections is higher I think its worse. Seriously considering two restarts per day just to stay afloat.

Here is the config for clarity. All shit that can be off is off (seemed to have helped a bit):

core: locale: en flushInterval: 300 cacheSize: 60000 precache: -1 showNotices: true showMyNotices: true opIsLWCAdmin: true disableHistory: true townyBorders: false groupPrefix: group. updateNotifier: false optional: onlyProtectWhenOwnerIsOnline: false onlyProtectWhenOwnerIsOffline: false exemptBlocks: [] optOut: true blacklistedBlocks:

mobilephone2003 commented 10 years ago

Been dropping ticks thanks to this too, cache hits 100%, any ideas on a solution?

dddeeefff commented 10 years ago

Protecting hoppers "removed" this issue for me, though that of course isn't an ideal solution. I've no idea how you'd go about finding unprotected hoppers, though. In my case, I knew the hoppers that were unprotected because they were all mine (thanks to WorldEdit).

I can't comment on whether this issue still affects us. We have a different server setup now, one with far less I/O issues

mobilephone2003 commented 10 years ago

We've removed all hoppers and we're smooth, but the cache is still super high

http://gyazo.com/718c1c8b1aa05b57edb46fa46bb2d5a0

Timings when hoppers were on:

http://gyazo.com/7229d61540167dd308a275daa194aacc

This is on a fairly beastly dedicated setup with SSDs

dddeeefff commented 10 years ago

off-topic, but you can increase the cache. I don't know how well those fancy SSDs of yours handle, so it might be worth doing if you've got the spare RAM.

mobilephone2003 commented 10 years ago

Thanks, I've doubled it, will see how it pans out