SirPlease / L4D2-Competitive-Rework

Just refreshing and optimizing the core files a bit, eh?
GNU General Public License v3.0
242 stars 131 forks source link

100% CPU usage and low tickrate after hours of server usage #665

Closed altair-sossai closed 1 year ago

altair-sossai commented 1 year ago

I have a VM on Azure using Ubuntu 20.04. After a few hours of server usage (somewhere between 4h and 5h) CPU usage is between 90% and 100%, causing a tickrate drop for all players.

Watch the video below (at minute 3:10 you can see the game crashing): https://www.youtube.com/watch?v=18fWPjI1J2A

The problem is only solved by restarting the virtual machine, see the image below the CPU usage right after the restart image

I rule out the hypothesis of a DDoS attack, I say this because I have already configured direct IP release policies in Azure and even so after a certain time the problem occurs, even with the ports to the internet closed

Azure hosted VM settings:

Server file repository https://github.com/altair-sossai/l4d2-zone-server

Commands used to configure the server

cd /home
sudo mkdir steam
cd /home/steam
sudo dpkg --add-architecture i386
sudo apt-get update
sudo apt-get apt-get upgrade
sudo apt-get install ufw -y
sudo apt-get install libc6:i386
sudo apt-get install lib32z1
sudo apt-get install screen
sudo apt-get install wget -y
sudo wget http://media.steampowered.com/installer/steamcmd_linux.tar.gz
sudo tar -xvzf steamcmd_linux.tar.gz
sudo ./steamcmd.sh +force_install_dir ./l4d2/ +login anonymous +app_update 222860 validate +exit

Command used to start a server

sudo screen -d -m -S "37000" /home/steam/l4d2/srcds_run -game left4dead2 -port 37000 +sv_clockcorrection_msecs 25 -timeout 10 -tickrate 100 +map c1m1_hotel -maxplayers 32 +servercfgfile server.cfg

Does anyone have any idea what might be going on? Have you been through this?

altair-sossai commented 1 year ago

I made this post right after the problem occurred on my server, something around 11 pm Brasilia time (Brazil), around 3 am the same problem occurred again, low tickrate and CPU with 100% usage

PaaNChaN commented 1 year ago

hello altair-sossai. l4d2 server are very CPU dependent. it requires a high frequency CPU to run at 100 tickrate. i believe a minimum clock count of 3.2Ghz is required.

i have tested various hosting and there are servers that lag even at high clock counts. the best remedy is to rent a dedicated server.

altair-sossai commented 1 year ago

I understand, however, what worries me is that it occurs after a certain time of use of the server. For example, yesterday I used the server with my friends and for 6 hours everything flowed very well, at a certain point everyone had a low tickrate and when viewing the CPU usage it was at 100%, restart the server, everyone connected again and continued for a few more hours with no problems. I could be completely wrong, but this sounds like a "memory leak" type problem that over time builds up and causes excessive CPU usage. Anyway, I'm going to improve the server's hardware a bit to use a better processor, I'll come back here in the next few days if the problem occurs

A1mDev commented 1 year ago

Restart the game server after each game to fix the problem

altair-sossai commented 1 year ago

When you say restart, is it just the srcds_run process or the operating system as a whole?

fantasylidong commented 1 year ago

You can install a plugin like linux_auto_restart .

theletterjwithadot commented 1 year ago

do you use !map to change map after a campaign ends? I recall seeing somewhere that changing maps via !map would cause such issues.

https://forums.alliedmods.net/showthread.php?p=2669850

altair-sossai commented 1 year ago

I always change the map through the admin menu, I did a quick search in the code and it seems to use sm_map, I'm going to change it and do some tests to see if it solves it.

image

PaaNChaN commented 1 year ago

this plugin forces a restart when the number of people on the server is empty. https://github.com/fbef0102/L4D1_2-Plugins/tree/master/linux_auto_restart

A1mDev commented 1 year ago

When you say restart, is it just the srcds_run process or the operating system as a whole?

I mean end scrds_run process and you can reboot the whole server every morning. I use this method because you can't take care of everything, I'm sure the game developers themselves could make a mistake somewhere and there will be a memory leak and it will be quite difficult to detect. I also do not think that anyone here will help you find the problem, it may take a long time. I can advise you to pay attention to 'Actions' extention :D

A1mDev commented 1 year ago

do you use !map to change map after a campaign ends? I recall seeing somewhere that changing maps via !map would cause such issues.

https://forums.alliedmods.net/showthread.php?p=2669850

I don’t know how true this is, no one can confirm this memory leak, although the game really has problems when changing the game after 'changelevel' or using function 'ForceChangeLevel' from the sourcemod, because some properties of the 'director' have not been changed, this is only noticeable in the scanvege game mode. By the way, this plugin has a bug, sometimes survivors at the start of the round are all dead and do not respawn.

altair-sossai commented 1 year ago

Thanks everyone for the help, I'll try two things.

  1. I'm going to avoid using the !map command to switch between campaigns, instead I'm going to use the game's native voting.
  2. After 2 or 3 games I will restart only the srcds_run process and not the machine

I'll do the test for a few days and come back here with news.

SirPlease commented 1 year ago

Ah, in our convo I didn't even think of this because I've had this set up for such a long time myself that I totally forgot about it. In my experience the servers would also get worse over time, the "auto restarter" linked by @PaaNChaN is pretty much the same implementation I use and it works great. This would also prevent the need for you to intervene manually.

altair-sossai commented 1 year ago

I made some adaptations in the @PaaNChaN plugin to force the server to crash after N games. I'll test it for a few days and let you know, thank you very much.

https://github.com/altair-sossai/l4d2-zone-server/blob/d7d7e2248cad164cd1e51480ffdf7c42c1c6f9f8/addons/sourcemod/scripting/l4d2_crash_server.sp

draxios commented 1 year ago

Agreed with everyone else, good to restart server daily and game server after every game is completed to avoid tickrate drops.

There's another implementation of a fix from Dragokas that has solved this issue for me. https://forums.alliedmods.net/showthread.php?p=2646280 https://github.com/draxios/bizzymod/blob/main/addons/sourcemod/scripting/sm_RestartEmpty.sp

SirPlease commented 1 year ago

Any updates? @altair-sossai

altair-sossai commented 1 year ago

Apparently the problem has been resolved, I'm crashing the server every 3 games

SirPlease commented 1 year ago

Glad to hear it! 😄 I'll go ahead and close the issue.

altair-sossai commented 1 year ago

After some thorough investigation into the problem I was experiencing with tickrate drops, I've managed to identify the cause and resolve the issue. I wanted to share my solution in case others encounter a similar issue in the future.

The problem arose when running on the Azure B2s machine. I noticed that, after some time, the server would consume 100% of the CPU resources, leading to a drop in tickrate. This occurred because the B2s machine ran out of "CPU Credits", which caused this issue.

Solution: I upgraded the machine from B2s to F2s v2 in Azure. After the upgrade, the server was able to function properly, without hitting 100% CPU usage or experiencing tickrate drops. The F2s v2 machine, unlike the B2s, provides dedicated CPU performance, which allowed the server to utilize more resources without running out of CPU credits.

Therefore, if you are experiencing a similar issue, I recommend considering an upgrade to a machine that offers dedicated CPU performance, rather than a CPU credit system.

l31rb4g commented 11 months ago

Hello,

I'm writing to report that I encountered the same issue that Altair described. I set up the server on an AWS t2 instance. It ran smoothly for approximately one hour, and then I experienced a huge drop in tickrate, with a corresponding rise in ping, similar to a DDoS attack. I was certain it wasn't an attack, so I began investigating other potential causes. Upon examining the charts, I made a significant discovery:

WhatsApp Image 2023-10-04 at 20 17 19

AWS t2 instances come with a "CPU Credit Balance," which was quickly depleted due to the server's high CPU usage. After about an hour, the balance had nearly reached zero, and the instance started experiencing a CPU limit. To resolve this issue, I switched the instance type to c6a, which doesn't have this credit balance, and that effectively solved the problem.

photo1697458144

Hope this helps someone.