CntoDev / central-plaza

Repository for the documentation that isn't related to a specific project and organization of new projects/ideas.
0 stars 0 forks source link

General analysis of our game server performance & maintenance #6

Closed milivojm closed 3 years ago

milivojm commented 3 years ago

General analysis of our game server performance. The hardware is from 2016 and a lot of stuff has been installed, deleted and rebuilt over time which could have some performance impact if there are any superfluous left-overs hogging resources. We could also look into setting up multiple Headless Clients, during the Vietnam finale op I somehow managed to crash the HC in the thick of the fighting, I assume they were all assigned to a single HC when I ticked the "move to HC" box in EDEN.

enrico-ghidoni commented 3 years ago

I can take on this, if you can assign the task to myself.

Would be currently blocked by investigating VPN (or a secondary mean of access) access to the server.

JamesTheClarke commented 3 years ago

Talked to @Shiny-CNTO yesterday, he can give limited access to VPN whereas the access can be bound to a specific date, i.e. define a project scope and then get access. That way we can limit the number of server admins all using the same account, potentially kicking each other out via Remote Access all the time.

Please contact Shiny directly via PM and he'll sort you out after he's back from vacation 20th of January onwards.

enrico-ghidoni commented 3 years ago

Alright. To get things started, I'd like to get a general idea on what exactly we want to monitor: do we care only about the raw hardware resources and their usage in different scenarios (i.e. in order to understand if there's room for more headless clients) or do we also want to get specific data out of the Arma server instance?

Cc @JamesTheClarke, I'll also tag @freghar since AFAIK has better knowledge in the latter scenario.

JamesTheClarke commented 3 years ago

Here's my personal priority list for this project:

1.) Are there any glaring red flags? Old unused apps that somehow hog resources, security issues etc. 2.) Is there quick-win optimisation potential? Bat file configs, improved HC setup (afaik we only use a single HC and I managed to crash it once last year) etc. 3.) Are there any larger overhaul projects that could improve our game server performance?

freghar commented 3 years ago

I think we should split analysis of the server itself and a human look at RPT / other factors ... and actual analysis/benchmarking of mods, which I thought was covered by https://github.com/CntoDev/cnto-assets/issues/9, which needs to be moved to this "ideas" repo.

edit: For the server side, maybe we could set up logging via the built-in Windows resource monitor. It would be useful to see if we're exceeding max RAM and the system has to swap a lot.

Shiny-CNTO commented 3 years ago

We can set up a notification issue that sends a message to email/app when cpu or ram or hardware disk space is above a certain threshold. The other thing we can do is move certain things to another VM if it hogs too much. However I never saw issues with CPU or RAM usage. We never had any lags due to a process taking up too much (outside of ARMA).

I managed to consistently crash our server during the finale of the False Gods campaign btw. This was due to the amount of enemies that scaled up with players. It worked fine with 4 players but with 14 it was too much.

milivojm commented 3 years ago

Think our server is just fine regarding all things, we have trouble with Arma, not so much with hardware (if at all). But, you can setup windows performance counters or go completely wild with PRTG (which is not free, maybe some free alternative exists) - @Didr

Didr commented 3 years ago

Think our server is just fine regarding all things, we have trouble with Arma, not so much with hardware (if at all). But, you can setup windows performance counters or go completely wild with PRTG (which is not free, maybe some free alternative exists) - @Didr

We could use the Grafana with Telegraf to collect the data. Especially if we think there's any gain with plotting CPU usage and shots fired on the same graph. 😁 Could also be used to post a message on Discord if any metric becomes too large (or too small) like Shiny is talking about. Example: image

Will the performance monitoring just be temporary? If it is then I'd go for the more simpler Windows Performance Monitor.

milivojm commented 3 years ago

Well, there is no reason for it to be temporary honestly. I'd go for longer term server monitoring. We want to know even in two years time that server is going to die.

Grafana is nice, probably not necessary. But still nice. Alerts is what we need, custom thresholds and all that jazz... Let's not go overboard with this. :)

AFAIK you deployed it on your private server? That's a component dependency that we need to avoid. We don't want private infrastructure to be used for CNTO because you might wanna do something else next week with your server and we're left without this.

Didr commented 3 years ago

AFAIK you deployed it on your private server? That's a component dependency that we need to avoid. We don't want private infrastructure to be used for CNTO because you might wanna do something else next week with your server and we're left without this.

Yupp, I was thinking about this. It'd be a blocker until the Grafana is moved to CNTO infrastructure.

enrico-ghidoni commented 3 years ago

I’d go for the alerts above a certain threshold first, but I think it would be interesting and helpful when it comes to both troubleshooting when crashes happen and deciding whether to add more HCs. The more data the merrier, but that can be done at a later time setting up a dashboard as well, so we can both monitor the server status at anytime and analyse the data combined with the ones generated by the Arma instance.

enrico-ghidoni commented 3 years ago

Solution A.

I looked a bit into Telegraf+InfluxDB as a solution. For the short term and long term requirements it would fit perfectly, the only cost is its maintenance since we'd need to run the DB on our infrastructure (going with the self-hosted, open source solution).

Features:

What we would need:

Pros:

Cons:

All in all this could be a complete solution, although a bit overkill and time-consuming to setup and perhaps maintain. It would certainly allow for expansion if we plan more monitoring during different Arma missions/campaigns/mods testing.

enrico-ghidoni commented 3 years ago

Solution B.

Use Windows Performance Monitor to aggregate basic hw data (CPU, memory, storage) and activate email alerts when a threshold is exceeded. I couldn't find any direct integration with any other services, but it is possible to run Powershell scripts instead of having an e-mail alert that could interact with a more direct mean of communication for the community (i.e. Discord).

This would not give us much room to expand the analysis to other data sources as far as I understood, but would definitely be the quickest way to implement the first step as discussed.

Shiny-CNTO commented 3 years ago

Neither of these do what this issue was created for though.

"1.) Are there any glaring red flags? Old unused apps that somehow hog resources, security issues etc. 2.) Is there quick-win optimisation potential? Bat file configs, improved HC setup (afaik we only use a single HC and I managed to crash it once last year) etc. 3.) Are there any larger overhaul projects that could improve our game server performance?"

We fixed 1. There are none. We looked at 2. I believe that is still under investigation? And 3 is a no.

So I doubt we need any monitoring software. We never run into issues. CPU and RAM are never full. Only HDD is. Yes we can setup monitoring on it but instead we can also just do this https://docs.microsoft.com/en-us/troubleshoot/windows-server/backup-and-storage/configure-low-disk-space-alert-performance-logs Build into Windows.

enrico-ghidoni commented 3 years ago

We can set up a notification issue that sends a message to email/app when cpu or ram or hardware disk space is above a certain threshold.

So I doubt we need any monitoring software.

Pick one then ;)

milivojm commented 3 years ago

@enrico-ghidoni - if you know how to do solution A, I'd be happy with that. Shiny, I agree this might be not needed, let's give it a shot and if data is backing what we presume, we can turn it off.

Powershell on the server is 2.0 I believe. Tried to upgrade but it's 2008 R2 and Microsoft...

JamesTheClarke commented 3 years ago

@enrico-ghidoni @milivojm I think a good lightweight solution could be a discord notification that puts out a message to #staff whenever either the tools or game server goes down/is no longer pingable.

This could then be expanded to other services as need be in the future.

enrico-ghidoni commented 3 years ago

@JamesTheClarke for that purpose Lastmikoi and I had already worked on https://github.com/CntoDev/monitoring-scripts. It has a different purpose than monitoring the resources on the server, as it simply pings every community service (we can pick which though) and updates some sort of status dashboard. That, combined with https://github.com/CntoDev/cachspeak then notifies community members (on TeamSpeak but something different can be arranged).

So, if we want to re-introduce that kind of monitoring (i.e. detecting downed services) I would go for restoring the Cachet status page and the scripts Lastmikoi and I worked on to have push notifications.

Do note that it's something different from what we discussed about data collection (even regarding hw monitoring), as it only checks whether the service (our Arma server in this case) is reachable.

I'd take some more time to precisely define what we want to achieve and possible long term features, so we don't get ourselves locked in something that will not work.

EDIT: nevermind, we've got the status page to Discord script https://github.com/CntoDev/cachcord

Didr commented 3 years ago

All in all this could be a complete solution, although a bit overkill and time-consuming to setup and perhaps maintain. It would certainly allow for expansion if we plan more monitoring during different Arma missions/campaigns/mods testing.

All good points. I'd like to note that the CNTR Stats and TS3 stats are already using InfluxDB with Grafana as the dashboard, so all of that already exists.

enrico-ghidoni commented 3 years ago

Alright, had a bit of a chat with @Didr as I was not aware that we already need to move the InfluxDB that the stats feature uses to our own infrastructure.

Regarding overall server performance monitoring and alerts on their usage, I would go with the Telegraf + InfluxDB since most of the cons I've highlighted in previous comments are basically non-existent considering our scenarios.

  1. We need to setup InfluxDB anyway
  2. Storage size required for the db can be kept low using retention policies

Also, there are alerts to Discord already that we can use out of the box. We can at a later time attach custom plugins for other stuff.

milivojm commented 3 years ago

Ok, can we move this into To Do now? It seems it's clear enough.

enrico-ghidoni commented 3 years ago

Yes. I'll come up with a basic roadmap in the next few days. Also, I don't think we need a specific repo for this, except maybe for the various configuration but the first thing to get started is the migration as previously mentioned.

Since I presume we agreed on the plan I can take back the issue on me.

enrico-ghidoni commented 3 years ago

Okay, to recap the steps and people/teams involved I'll briefly list them here. Will move everything into a proper repo as soon as it's available.

Shiny-CNTO commented 3 years ago

Isn't it easier to move it to the main server so it doesn't have to have the connection and stuff set up?

enrico-ghidoni commented 3 years ago

I think it would be better to leave the main server for what's strictly necessary to Arma, so the server instance itself and the repo

enrico-ghidoni commented 3 years ago

Moving the implementation discussion here. Though I'd keep this one open for further doubts like what @Shiny-CNTO pointed out

enrico-ghidoni commented 3 years ago

Not needed since we have the dedicated repo. Closing.