anegostudios / VintageStory-Issues

Vintage Story's public issue tracker for reporting bugs, crashes and the like
46 stars 16 forks source link

Server side lag problems in Linux #578

Open Fluffkin opened 4 years ago

Fluffkin commented 4 years ago

Game Version: 1.13.2 Platform: Linux Modded: No

Description

Server fails to keep up with playing the game, giving multiple lag spikes. Game is basically unplayable. Even standing still I can see the game pauses as grass movement freezes periodically. Placing blocks becomes "Minecraft lag" problematic (server lags when placing, resumes showing 2 or 3 blocks places when only 1 intended). Standing still for a long time doesn't help much, any kind of subsequent movement / block breaking / placing can cause a lag spike again.

How to reproduce

  1. Start a new game in Linux
  2. Walk around, try and break things, wait for lag spikes to occur

Logs

6.9.2020 00:11:30 [Warning] Server overloaded. A tick took 697ms to complete.
6.9.2020 00:11:55 [Warning] Server overloaded. A tick took 751ms to complete.
6.9.2020 00:12:22 [Warning] Server overloaded. A tick took 600ms to complete.
6.9.2020 00:12:26 [Warning] Server overloaded. A tick took 810ms to complete.
6.9.2020 00:12:27 [Warning] Server overloaded. A tick took 655ms to complete.
6.9.2020 00:12:29 [Warning] Server overloaded. A tick took 1394ms to complete.
6.9.2020 00:12:53 [Warning] Server overloaded. A tick took 606ms to complete.
6.9.2020 00:13:16 [Warning] Server overloaded. A tick took 805ms to complete.

Are the only errors I can see in the "server-main" logs. No errors seen in any other logs at the time. Using mono 6.10.0.104 in Arch Linux (Artix actually), but have seen the same problem using Void. CPU is not overheating and not overloaded, memory isn't constrained.

tyronx commented 4 years ago

Thanks for the report, please provide us with additional information as described here: https://wiki.vintagestory.at/index.php?title=Debugging

Fluffkin commented 4 years ago

Thanks for the link to the debug page, didn't know that existed. Attached relevant bit of logs:

server-main.txt

Eloraam commented 4 years ago

I see similar long lag spikes on linux when it does an "off-thread save". I assume it's a contended lock or something.

I also get the same in single player, where it also causes major framerate loss, so I've taken to always using a dedicated server to at least keep the client running smoothly.

tyronx commented 4 years ago

Thanks for the feedback!

@Fluffkin could you also run .debug logticks 100 and send me the client-main.txt, as you seem to get client side freezes too. Also how is your CPU usage while the game runs?

@Eloraam that does not sound ideal - can I ask you what hardware are you running on? (HDD/SSD? CPU specs? Dedicated or integrated graphics?) Does a dedicated server entirely solve your framerate loss in that case?

Eloraam commented 4 years ago

Threadripper 2990WX, 128GB RAM, RTX 2080. Regular HDDs in RAID-6, but that should only matter if you're making sync calls - enough RAM in Linux means you only ever make a read call once.

The dedicated server completely solves the framerate loss. When the server goes out to lunch, things get a little glitchy for a second or two, but the render loop keeps running.

Fluffkin commented 4 years ago

@tyronx Attached client-main.txt from the same session (no errors) EDIT: Oops ... debug fail resending below

CPU usage is in single digit %, actually seems to drop when the lag occurs (looks like the server process is sleeping / waiting?).

Running a dedicated server (on the same machine) as Eloraam reports does not fix the issue for me. I still get frame rate drop outs, the game is still unplayable.

Fluffkin commented 4 years ago

@Eloraam Having to run the server as a dedicated process on a Threadripper with 128GB of RAM to achieve periodically "things get a little glitchy for a second or two" doesn't sound like a great result to me. It's unplayable here on Ryzen 5 3600X with 64GB RAM.

tyronx commented 4 years ago

That is strange, I do not experience such lag issues on a much slower machine on windows 10. (i5 6300 HQ, 16 GB Ram)

@Fluffkin I think you missed to enable the tick profiler with above command

tyronx commented 4 years ago

@copygirl hows the performance for you on Arch linux? And what mono version do you use?

copygirl commented 4 years ago

I haven't been playing in a while so I can only go by what I can test real quick in a new world: I notice micro-stutters (FPS varying wildly) and bigger spikes. I had a feeling things got smoother over time. My Mono version is 6.7.0.356.

I also ran .debug logticks 100 and /debug logticks 100: client-main.txt server-main.txt

Fluffkin commented 4 years ago

@tyronx Yup you're right, here's a short burst from client-main.txt with debugging actually enabled. client-main.txt

Unfortunately things don't get better with time. I've spent about 3 days in the VS world so far in 4 attempts to get it working well enough to play. After 3 in game days I'm usually in one area with a clay source, a shelter, basket / starting to get set up, so world generation in that area shouldn't be a problem, but I still get bad enough lag spikes that I just give up playing it. It's particularly annoying when trying to make clay items, feels like it takes twice as long because the game keeps glitching and missing where I thought I'd placed clay.

I've also tried running it with "mono -O=all Vintagestory.exe" (turns on all optimisations), and that doesn't help either.

Eloraam commented 4 years ago

I noticed things tend to get a little worse over time if I try to play in single player (plus one LAN player). It's always been a bit of a problem but this most recent version was nearly unplayable. Noticed that another machine on my network was extremely playable when I was hosting, so I switched to dedicated and things have been more or less okay since.

The server does still lag out from time to time, but it doesn't effect the client much. Does VS have an actual single-player mode or does it always run a server?

tyronx commented 4 years ago

Thanks for all the information.

The frame profiler results seem all over the place, judging by your logs. This could be indeed an indicator that there is a lock contention between threads or OS resources.

@copygirl and @Fluffkin: Are you on a HDD or SSD? @Fluffkin: Could you also show me a screenshot of the extended (via .edi command) debug screen (CTRL+F3) when its particularly laggy? @Eloraam The singleplayer mode runs indeed an internal server in a separate thread

copygirl commented 4 years ago

I believe my home directory is on the 2TB HDD I have.

Fluffkin commented 4 years ago

@tyronx Running on HDD here. I'm probably being dumb, but Ctrl+F3 doesn't change anything on screen at all for me. .edi does give me a message that extended debugging is enabled, but nothing to screenshot.

EDIT: Had to rebind the key, guess maybe my window manager has ctrl+F3 bound to something else. Anyway, couple of screen grabs, from starting a new game. The really big FPS jump is when it stalls... and I can't attach jpg or png to github entries .. hmmm

See: https://gyazo.com/28d72add44624bf63306f8a75a4aa95f and https://gyazo.com/74f49dd07d5330ca6a68c3ab48f3a99c

Eloraam commented 4 years ago

Oh dear. If the server can slow down the rendering thread, it points to something common between the two. I assume the only locks they share are on the communications channel, so that makes me look GC-ward.

tyronx commented 3 years ago

I believe the issue here is that perhaps the savegame system does too many random read/writes that are hard to handle for an HDD. Not sure why it would lag the server though. Would need to do a comprehensive test on that

Vam-Jam commented 3 years ago

Same thing is happening on my server linux server. Overloads seemingly at random, which spikes over 2 seconds.

The server is also sat on a HDD as a note.

Logs: server-main-snipped.txt

Vam-Jam commented 3 years ago

As an update: I've placed vs onto a ramdisk (along with the data folder) with plenty of room, this has fixed the issue. There's no more sudden overloading issues

For now this is fine since the world file isn't too big, and has backups every so often with a bash script.

Netdata before: image Purple shows when disk is being used, when ever it spiked, it should cause said lag

After: image No more spikes, server isnt complaining about x overloading any more.

Eloraam commented 3 years ago

Any chance you're manually syncing the disk? Sync can take a very long time to come back on a Linux machine.

I do a fair bit of random writing in my own game engine (probably with a very similar pattern), and I don't see this kind of behavior.

Vam-Jam commented 3 years ago

Any chance you're manually syncing the disk? Sync can take a very long time to come back on a Linux machine.

I do a fair bit of random writing in my own game engine (probably with a very similar pattern), and I don't see this kind of behavior.

Nope, not seen something like this else where either.

Eloraam commented 3 years ago

Sorry, that was supposed to be directed at Tyron. Just curious if the engine executes an fsync() when it writes to disk. That would absolutely cause this.

tyronx commented 3 years ago

@Eloraam I'm not doing any savegame file i/o myself. It's all sqlite3 under the hood that takes care of that. It also runs in a separate thread so not sure yet why its locking up the main thread

Eloraam commented 3 years ago

That's a start. sqlite does use fsync() to maintain ACID, since it's a database engine.

The question, then: what happens to your engine if that separate thread freezes for a second or two in an sqlite call, and can't respond to any IPC from your main thread?

The sqlite docs mention a synchronous pragma that you could set to 'off' to turn off the flushing behavior entirely. It's a little less resistant to data loss during a system crash, but won't pause for a full disk sync.

tyronx commented 3 years ago

Yea I probably just need to sit down and investigate carefully. I guess some lock contention somewhere.

tyronx commented 3 years ago

So I had a closer look and simulated laggy disk write with some Thread.Sleep() and an autosave every 30 seconds, as my HDD did not seem to lag.

Not sure how close that is to the reported lag issues here, but I did find several lock contentions that should be fixed or at least mitigated in v1.14.3

Vam-Jam commented 3 years ago

Just letting you know, this issue does not seem to be fully fixed.

I recently moved the world save back to my HDD, and the issue still remains. Although it does seem to spike less then it did pre 1.14.3

Vam-Jam commented 3 years ago

The thing that's causing the lag spikes seems to be chunk unloading. Setting chunkunload to off stop's the random lag spikes (Turned it off for 30 mins, only got an overloading when world was being saved), turning it back on causes them to appear (normally with in a 30 seconds).