Open tingox opened 4 years ago
The directory /var/lib/postgresql/backups/ was filling up with db backups, causing the disk to fill. I cleaned out a few files, the disk is now better:
root@p2k16:~# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 8.9G 9.8G 48% /
and then I restartet postgres via systemctl restart postgresql@10-main.service
.
There is a backup service for postegres, I haven't restarted it
root@p2k16:~# systemctl status postgresql-base-backup@10-main.service
● postgresql-base-backup@10-main.service - PostgreSQL base backup
Loaded: loaded (/etc/systemd/system/postgresql-base-backup@.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2020-04-26 04:00:51 CEST; 7h ago
Process: 19137 ExecStart=/usr/bin/env bash -c i="10-main"; i=${i/-//}; bin/envdir /etc/wal-e/10-main-env.
Main PID: 19137 (code=exited, status=1/FAILURE)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
should this serrvice be running, or should we stop it?
p2k16-staging also had the same problem, so I did the same there: clean out most files from db backups, then resstart postgresql.
Perhaps it would be nice to have some monitoring on that, with email alerts to those who run the system - something like zabbix? I have a zabbix VM running…
monitoring is in place, we miss someplace good to send the alerts. Our "IT operations group" is on a volunteer basis...
Where can I see this monitoring status?
monitoring is at riemann.bitraf.no
It happened again; the disk of p2k16 filled up with postgres database backups, and the postgres service failed, causing p2k16 to fail. The disk full error was dutifully recorded by riemann.bitraf.no, but nobody looked at it.
Fix: usual fix - clean out database backups from /var/lib/postgresql/backups/
, then restart postgres with sudo systemctl restart postgresql@10-main.service
.
perhaps we should add a separate (virtual) disk drive for database backups to the server p2k16. Or better: make sure that the backups go to another server instead. Hmm.
cleaned the backup directory on p2k16-staging, better now
tingo@p2k16-staging:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 8.9G 9.8G 48% /
then I restarted postgres with
sudo systemctl restart postgresql@10-main.service
that's all
cleaned backup directory on p2k16-staging again
tingo@p2k16-staging:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 11G 7.9G 58% /
and restarted postgres
Another cleaning of the postgres backup directory on p2k16-staging today:
tingo@p2k16-staging:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 9.9G 8.7G 54% /
plus a restart of postgres.
p2k16 had full disk again. As usual, I cleaned out /var/lib/postgresql/backups/
, then restart postgres with sudo systemctl restart postgresql@10-main.service
. Better now:
tingo@p2k16:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 8.3G 11G 45% /
We could probably extend the backup service to only retain a set number of base backups, or have a different timer retain only N basebackups.
Seems to be the code for installing the wal-e backup service, doing something similar for delete might be good enough
ExecStart=/usr/bin/env bash -c 'i="%i"; i=${i/-//}; bin/envdir /etc/wal-e/%i-env.d bin/wal-e delete --retain 5 --confirm'
How about we just send notifications to Slack? https://riemann.io/api/riemann.slack.html Easy enough, and lots of people around to see it.
It has been mentioned before. This won't help the fact that few people have access to solving the issues when they happen, but at least it will alarm people earlier so that whoever can fix it, have time to do so. Just, please, make sure it won't trigger an alarm for everything. Too many false positives will ruin the whole projects.
It has been mentioned before. This won't help the fact that few people have access to solving the issues when they happen, but at least it will alarm people earlier so that whoever can fix it, have time to do so. Just, please, make sure it won't trigger an alarm for everything. Too many false positives will ruin the whole projects.
Better for everyone to be notified so that someone will take action , or mention it to someone that can, than it just fail silently because nobody manually checked monitoring.
p2k16 - full disk again today. Cleaned out /var/lib/postgresql/backups/
, then restart postgres with sudo systemctl restart postgresql@10-main.service
as usual. Good to go for a while again:
tingo@p2k16:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 8.6G 11G 46% /
Maintenance this evening, cleaned out /var/lib/postgresql/backups/
on p2k16 before it gets full again.
tingo@p2k16:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 12G 6.7G 65% /
that keeps a few weeks, I think.
Med tanke på hvor kritisk p2k16 er for hele virksomheten, er det egentlig akseptabelt med kun 6.7G ledig og at den går full med noen ukers mellomrom?
Jeg vil si nei og at vi bør investere i nødvendig utstyr for at dette ikke skal skje.
On 18 May 2021, at 18:32, Torfinn Ingolfsen @.***> wrote:
Maintenance this evening, cleaned out /var/lib/postgresql/backups/ on p2k16 before it gets full again.
@.***:~$ df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda1 20G 12G 6.7G 65% / that keeps a few weeks, I think.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bitraf/infrastructure/issues/138#issuecomment-843328937, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMZGOHN4VUGO7L7OS35RFDTOKJCHANCNFSM4MREUIQA.
Helt enig, men jeg føler meg ikke kompetent til å specce opp og sette opp. Men jeg godkjenner glatt innkjøp som minsker nedetidsfare.
Thomas
tir. 18. mai 2021, 19:13 skrev Håvard Espeland @.***>:
Med tanke på hvor kritisk p2k16 er for hele virksomheten, er det egentlig akseptabelt med kun 6.7G ledig og at den går full med noen ukers mellomrom?
Jeg vil si nei og at vi bør investere i nødvendig utstyr for at dette ikke skal skje.
- H
On 18 May 2021, at 18:32, Torfinn Ingolfsen @.***> wrote:
Maintenance this evening, cleaned out /var/lib/postgresql/backups/ on p2k16 before it gets full again.
@.***:~$ df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda1 20G 12G 6.7G 65% /
that keeps a few weeks, I think.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bitraf/infrastructure/issues/138#issuecomment-843328937, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMZGOHN4VUGO7L7OS35RFDTOKJCHANCNFSM4MREUIQA .
Men hva er det egentlig vi logger så aggresivt? Dette kan jo ikke dreie seg om vanlig bruk av systemet (innlogging/utsjekking). Det må være noe mer som logges for å komme opp i mange gigabyte på bare et par uker? Jeg har aldri sett på loggene, men jeg mistenker at det ikke er nødvendig med mer hardware her - heller en optimalisering av hva som logges slik at det som står i loggene er nyttig.
Det burde jo bare være å sette opp logrotasjon, evt sende loggene til en annen server først. Eller kanskje enda bedre - logge parallelt til en annen server og så ha kort rotasjon lokalt. @jenschr jeg gjetter at det kan være webserverloggen.
før det sporer helt av her: det som fyller opp disken er databasebackup'er - har ingenting med logger og gjøre. Såvidt meg bekjent bruker p2k16 databasen på helt vanlig måte - ikke spesielt intensivt.
Beklager, men da burde det vel være mulig å sende den backupen til en annen server og heller bare overskrive gamle backuper lokalt?
Beklager, men da burde det vel være mulig å sende den backupen til en annen server og heller bare overskrive gamle backuper lokalt?
Selvfølgelig er det mulig - det krever dog at mennesker med rett kompetanse (og ledig tid) setter seg ned og faktisk gjør jobben. Noen av oss har forsøkt å lage en løsning for å begrense antall lokale backuper (se #144), uten at vi kom helt i mål. Jeg er definitivt ingen ekspert på postgresql, så jeg har ikke mer å bidra med der.
Forslag: Ta en dump jevnlig til en katalog og du har ei fil, typisk pg_dump -Fc dbnavn > dbnavn.dump. -Fc er --format=custom, noe som gjør at man kan ta en restore av separate tabeller eller tilsvarende uten så mye knot. I tillegg gzipper den dataene for deg. Å kjøre en dump uten -Fc funker jo også og utgjør ikke noen forskjell i denne sammenhengen, men jeg ville bare nevne det. Denne kjøres typisk en gang i døgnet, så sett opp logrotate til å bare rotere denne (uten å komprimere mer) som om det var ei loggfil. logrotate ser jo ikke på innholdet uansett og konfigrasjonen er enkel.
it didn't keep as long as I had hoped, today the disk was full again, so p2k16 stopped letting people open the door. Cleaned up, restarted postgresql. Disk space looks better now:
tingo@p2k16:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 7.7G 11G 42% /
but probably doesn't hold two weeks.
Sometimes (not very often) the disk drive of the p2k16 serve fills up. this is bad, because the then p2k16 web app stops working.
the server p2k16 runs two services related to the PostgreSQL database: postgresql@10-main.service postgresql-base-backup@10-main.service and also this postgresql-base-backup@10-main.timer we also have monitoring (via riemann), but nodbody watches that on a regular basis (not sure if someone gets alarm notificatons).