bitraf / infrastructure

Infrastructure stuff for Bitraf's sysadmins
1 stars 3 forks source link

p2k16 - sometimes the disk fills up, and p2k16 stops working #138

Open tingox opened 4 years ago

tingox commented 4 years ago

Sometimes (not very often) the disk drive of the p2k16 serve fills up. this is bad, because the then p2k16 web app stops working.

the server p2k16 runs two services related to the PostgreSQL database: postgresql@10-main.service postgresql-base-backup@10-main.service and also this postgresql-base-backup@10-main.timer we also have monitoring (via riemann), but nodbody watches that on a regular basis (not sure if someone gets alarm notificatons).

tingox commented 4 years ago

The directory /var/lib/postgresql/backups/ was filling up with db backups, causing the disk to fill. I cleaned out a few files, the disk is now better:

root@p2k16:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.9G  9.8G  48% /

and then I restartet postgres via systemctl restart postgresql@10-main.service.

tingox commented 4 years ago

There is a backup service for postegres, I haven't restarted it

root@p2k16:~# systemctl status postgresql-base-backup@10-main.service
● postgresql-base-backup@10-main.service - PostgreSQL base backup
   Loaded: loaded (/etc/systemd/system/postgresql-base-backup@.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2020-04-26 04:00:51 CEST; 7h ago
  Process: 19137 ExecStart=/usr/bin/env bash -c i="10-main"; i=${i/-//}; bin/envdir /etc/wal-e/10-main-env.
 Main PID: 19137 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

should this serrvice be running, or should we stop it?

tingox commented 4 years ago

p2k16-staging also had the same problem, so I did the same there: clean out most files from db backups, then resstart postgresql.

rkarlsba commented 4 years ago

Perhaps it would be nice to have some monitoring on that, with email alerts to those who run the system - something like zabbix? I have a zabbix VM running…

tingox commented 4 years ago

monitoring is in place, we miss someplace good to send the alerts. Our "IT operations group" is on a volunteer basis...

rkarlsba commented 4 years ago

Where can I see this monitoring status?

tingox commented 4 years ago

monitoring is at riemann.bitraf.no

tingox commented 4 years ago

It happened again; the disk of p2k16 filled up with postgres database backups, and the postgres service failed, causing p2k16 to fail. The disk full error was dutifully recorded by riemann.bitraf.no, but nobody looked at it. Fix: usual fix - clean out database backups from /var/lib/postgresql/backups/, then restart postgres with sudo systemctl restart postgresql@10-main.service.

tingox commented 4 years ago

perhaps we should add a separate (virtual) disk drive for database backups to the server p2k16. Or better: make sure that the backups go to another server instead. Hmm.

tingox commented 4 years ago

cleaned the backup directory on p2k16-staging, better now

tingo@p2k16-staging:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.9G  9.8G  48% /

then I restarted postgres with sudo systemctl restart postgresql@10-main.service that's all

tingox commented 4 years ago

cleaned backup directory on p2k16-staging again

tingo@p2k16-staging:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G   11G  7.9G  58% /

and restarted postgres

tingox commented 4 years ago

Another cleaning of the postgres backup directory on p2k16-staging today:

tingo@p2k16-staging:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  9.9G  8.7G  54% /

plus a restart of postgres.

tingox commented 3 years ago

p2k16 had full disk again. As usual, I cleaned out /var/lib/postgresql/backups/, then restart postgres with sudo systemctl restart postgresql@10-main.service. Better now:

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.3G   11G  45% /
omega commented 3 years ago

We could probably extend the backup service to only retain a set number of base backups, or have a different timer retain only N basebackups.

https://github.com/bitraf/infrastructure/blob/3dabb24622ab3cedaa7a1fbb93f1325a6941e69c/shared-roles/postgresql-wal-e/tasks/main.yml#L85

Seems to be the code for installing the wal-e backup service, doing something similar for delete might be good enough

ExecStart=/usr/bin/env bash -c 'i="%i"; i=${i/-//}; bin/envdir /etc/wal-e/%i-env.d bin/wal-e delete --retain 5 --confirm'
flexd commented 3 years ago

How about we just send notifications to Slack? https://riemann.io/api/riemann.slack.html Easy enough, and lots of people around to see it.

rkarlsba commented 3 years ago

It has been mentioned before. This won't help the fact that few people have access to solving the issues when they happen, but at least it will alarm people earlier so that whoever can fix it, have time to do so. Just, please, make sure it won't trigger an alarm for everything. Too many false positives will ruin the whole projects.

flexd commented 3 years ago

It has been mentioned before. This won't help the fact that few people have access to solving the issues when they happen, but at least it will alarm people earlier so that whoever can fix it, have time to do so. Just, please, make sure it won't trigger an alarm for everything. Too many false positives will ruin the whole projects.

Better for everyone to be notified so that someone will take action , or mention it to someone that can, than it just fail silently because nobody manually checked monitoring.

tingox commented 3 years ago

p2k16 - full disk again today. Cleaned out /var/lib/postgresql/backups/, then restart postgres with sudo systemctl restart postgresql@10-main.service as usual. Good to go for a while again:

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.6G   11G  46% /
tingox commented 3 years ago

Maintenance this evening, cleaned out /var/lib/postgresql/backups/ on p2k16 before it gets full again.

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G   12G  6.7G  65% /

that keeps a few weeks, I think.

haavares commented 3 years ago

Med tanke på hvor kritisk p2k16 er for hele virksomheten, er det egentlig akseptabelt med kun 6.7G ledig og at den går full med noen ukers mellomrom?

Jeg vil si nei og at vi bør investere i nødvendig utstyr for at dette ikke skal skje.

On 18 May 2021, at 18:32, Torfinn Ingolfsen @.***> wrote:

Maintenance this evening, cleaned out /var/lib/postgresql/backups/ on p2k16 before it gets full again.

@.***:~$ df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda1 20G 12G 6.7G 65% / that keeps a few weeks, I think.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bitraf/infrastructure/issues/138#issuecomment-843328937, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMZGOHN4VUGO7L7OS35RFDTOKJCHANCNFSM4MREUIQA.

haavares commented 3 years ago

Helt enig, men jeg føler meg ikke kompetent til å specce opp og sette opp. Men jeg godkjenner glatt innkjøp som minsker nedetidsfare.

Thomas

tir. 18. mai 2021, 19:13 skrev Håvard Espeland @.***>:

Med tanke på hvor kritisk p2k16 er for hele virksomheten, er det egentlig akseptabelt med kun 6.7G ledig og at den går full med noen ukers mellomrom?

Jeg vil si nei og at vi bør investere i nødvendig utstyr for at dette ikke skal skje.

  • H

On 18 May 2021, at 18:32, Torfinn Ingolfsen @.***> wrote:

Maintenance this evening, cleaned out /var/lib/postgresql/backups/ on p2k16 before it gets full again.

@.***:~$ df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda1 20G 12G 6.7G 65% /

that keeps a few weeks, I think.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bitraf/infrastructure/issues/138#issuecomment-843328937, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMZGOHN4VUGO7L7OS35RFDTOKJCHANCNFSM4MREUIQA .

jenschr commented 3 years ago

Men hva er det egentlig vi logger så aggresivt? Dette kan jo ikke dreie seg om vanlig bruk av systemet (innlogging/utsjekking). Det må være noe mer som logges for å komme opp i mange gigabyte på bare et par uker? Jeg har aldri sett på loggene, men jeg mistenker at det ikke er nødvendig med mer hardware her - heller en optimalisering av hva som logges slik at det som står i loggene er nyttig.

rkarlsba commented 3 years ago

Det burde jo bare være å sette opp logrotasjon, evt sende loggene til en annen server først. Eller kanskje enda bedre - logge parallelt til en annen server og så ha kort rotasjon lokalt. @jenschr jeg gjetter at det kan være webserverloggen.

tingox commented 3 years ago

før det sporer helt av her: det som fyller opp disken er databasebackup'er - har ingenting med logger og gjøre. Såvidt meg bekjent bruker p2k16 databasen på helt vanlig måte - ikke spesielt intensivt.

rkarlsba commented 3 years ago

Beklager, men da burde det vel være mulig å sende den backupen til en annen server og heller bare overskrive gamle backuper lokalt?

tingox commented 3 years ago

Beklager, men da burde det vel være mulig å sende den backupen til en annen server og heller bare overskrive gamle backuper lokalt?

Selvfølgelig er det mulig - det krever dog at mennesker med rett kompetanse (og ledig tid) setter seg ned og faktisk gjør jobben. Noen av oss har forsøkt å lage en løsning for å begrense antall lokale backuper (se #144), uten at vi kom helt i mål. Jeg er definitivt ingen ekspert på postgresql, så jeg har ikke mer å bidra med der.

rkarlsba commented 3 years ago

Forslag: Ta en dump jevnlig til en katalog og du har ei fil, typisk pg_dump -Fc dbnavn > dbnavn.dump. -Fc er --format=custom, noe som gjør at man kan ta en restore av separate tabeller eller tilsvarende uten så mye knot. I tillegg gzipper den dataene for deg. Å kjøre en dump uten -Fc funker jo også og utgjør ikke noen forskjell i denne sammenhengen, men jeg ville bare nevne det. Denne kjøres typisk en gang i døgnet, så sett opp logrotate til å bare rotere denne (uten å komprimere mer) som om det var ei loggfil. logrotate ser jo ikke på innholdet uansett og konfigrasjonen er enkel.

tingox commented 3 years ago

it didn't keep as long as I had hoped, today the disk was full again, so p2k16 stopped letting people open the door. Cleaned up, restarted postgresql. Disk space looks better now:

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  7.7G   11G  42% /

but probably doesn't hold two weeks.