colouring-cities / colouring-britain

Developed out of the Colouring London prototype. Collecting data on Britain's buildings and testing new core features
https://colouringbritain.org/
GNU General Public License v3.0
10 stars 2 forks source link

Investigate issue with Production Server Crashing #328

Open mdsimpson42 opened 5 months ago

mdsimpson42 commented 5 months ago

The production server is intermittently crashing. Rebooting the machine is the easiest fix for the problem, but it doesn't seem to be a long-term solution.

Immediate suspects are:

@matkoniecz, can you use this issue to keep track of the things you've been doing to investigate the problem, please?


Virtual machines

The main issue here is that increasing the size of the machine doubles the price!

It might be a good idea to evaluate how much memory we actually need and select a different model of VM that will give us the performance that we need (and keep an eye on it as we add more of Britain to the application).

This is arguably a separate issue, but it will depend on whether there is a bug in the application or whether the app has simply become too big to run on that B2 VM.

matkoniecz commented 5 months ago

journalctl -o short-precise -k -b -1 seems to show that there are memory issues, see say

Mar 27 09:59:59.183338 cl-production kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/pm2-nodeapp.service,task=node,p>
Mar 27 09:59:59.184149 cl-production kernel: Out of memory: Killed process 181979 (node) total-vm:3489228kB, anon-rss:2315820kB, file-rss:0kB, shmem-rss:0kB, UID:997 pgtables:8344kB oom_sc>
Mar 27 10:00:00.951168 cl-production kernel: loop10: detected capacity change from 0 to 8
Mar 27 11:02:11.849124 cl-production kernel: hv_utils: Shutdown request received - graceful shutdown initiated

Initial suspicions are:

I was on call with Stuart and there was suggestion to

Right now production server has crude memory use logging in crontab.

* * * * * /bin/echo $(/usr/bin/date +\%s),$(vmstat -snt|grep "free memory") >> /home/cladmin/memory.log 2> /home/cladmin/memory.log.error

/usr/sbin/sendmail now contains

#!/bin/bash
cat >>/home/cladmin/cron.log

so that cron errors can be read. This way I (re)learned that % in crontab means newline and needs to be escaped.

Also, crontab is now backuped at https://github.com/colouring-cities/colouring-london-config/blob/main/prod_crontab (previously AFAIK this config was not recorded anywhere)

Right now I am running manually triggered backup script to find out is it the trigger.

matkoniecz commented 5 months ago

I switched memory logger, logging free memory was obviously pointless as once used memory is counted as buff/cache but it can be reused.

Now it is

* * * * * /bin/echo $(/usr/bin/date +\%s),$(vmstat -snt|grep "K active memory") >> /home/cladmin/active_memory.log 2> /home/cladmin/active_memory.log.error
* * * * * /bin/echo $(/usr/bin/date +\%s),$(vmstat -snt|grep "K used memory") >> /home/cladmin/used_memory.log 2> /home/cladmin/used_memory.log.error
* * * * * /bin/echo $(/usr/bin/date +\%s),$(vmstat -snt|grep "K total memory") >> /home/cladmin/total_memory.log 2> /home/cladmin/total_memory.log.error
mdsimpson42 commented 5 months ago

The only log of memory use I've found on Zure so far is the basic metrics. On the VM under Metrics -> "Available Memory Bytes".

image

mdsimpson42 commented 5 months ago

I think there is a way to get access to more data, but it requires enabling "Insights". Currently, these aren't enabled (I believe its a new feature that was in Beta until recently). I'll double check whether it costs anything and if not, I'll enable them now, so we should have more detailed data going forward.

mdsimpson42 commented 5 months ago

I don't think the Insights are free, they will add an (allegedly) small cost to the subscription. I can't find any quotes/estimates for how much it will cost.

I'll enable them for now, as they could be very useful, and see how much the cost is. We can always shut them down if they're too expensive (or once we've fixed this problem).