Open aminvakil opened 2 years ago
cc @BYK @chadwhitacre @hubertdeng123
There are a few interesting and important things:
VACUUM FULL public.nodestore_node;
part in any kind of production setup as it will lock the whole table for an indefinite time. It also requires additional space on disk so it has a high potential to put the system in an inoperable and hard-to-recover state.sentry cleanup
command that we trigger via the cron does run nodestore.cleanup()
:
https://github.com/getsentry/sentry/blob/1d24b160a74e10759d20fa323f5251bffee3f7b5/src/sentry/runner/commands/cleanup.py#L260-L268 which is defined at https://github.com/getsentry/sentry/blob/1d24b160a74e10759d20fa323f5251bffee3f7b5/src/sentry/nodestore/django/backend.py#L54-L62So technically, this should work, if you set SENTRY_RETENTION_DAYS
to 90
or less which is the default. The only missing thing I see here is an explicit VACUUM
call but autovacuum
should handle that already: https://www.postgresql.org/docs/12/routine-vacuuming.html#AUTOVACUUM
Now we need to figure out which part of the puzzle is broken:
nodestore.cleanup()
call successfully?DELETE
commands?VACUUM
command after a cleanup operation?I have nothing to add to what BYK said. This should work out of the box assuming you run sentry cleanup
I lost all event details.
Did you lose all event details because you ran the command without substituting a real value for SENTRY_RETENTION_DAYS
, is that what you mean? Does Postgres silently accept INTERVAL ‘SENTRY_RETENTION_DAYS’
as equivalent to zero, something like that?
This issue has gone three weeks without activity. In another week, I will close it.
But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog
or Status: In Progress
, I will leave it alone ... forever!
"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀
I lost all event details.
Did you lose all event details because you ran the command without substituting a real value for
SENTRY_RETENTION_DAYS
, is that what you mean? Does Postgres silently acceptINTERVAL ‘SENTRY_RETENTION_DAYS’
as equivalent to zero, something like that?
Sorry for late response, No, I ran this exact same command: DELETE FROM public.nodestore_node WHERE "timestamp" < NOW() - INTERVAL '1 day';
This issue has gone three weeks without activity. In another week, I will close it.
But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog
or Status: In Progress
, I will leave it alone ... forever!
"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀
We hit this problem as well. nodestore_node
was 42GB and the table had never been vacuumed.
We ran the sentry-cleanup
service which did it's thing, but did not reduce disk space. We ran vacuum analyze
which did not reduce disk space. We ran vacuum full nodestore_node
and that did reduce disk space to 3GB. Postgres also then reported that vacuum had been executed one time.
postgres=# select relname, vacuum_count, n_tup_del, n_live_tup, n_dead_tup
from pg_stat_sys_tables
where relname='pg_toast_20247';
relname | vacuum_count | n_tup_del | n_live_tup | n_dead_tup
----------------+--------------+-----------+------------+------------
pg_toast_20247 | 1 | 62889277 | 1840323 | 0
(1 row)
So it seems that (3) is true. The autovacuum thingy inside the postgres container is not working.
However, this seems to indicate that it is in fact enabled:
postgres=# SHOW autovacuum;
autovacuum
------------
on
(1 row)
Someone hinted on a forum that if Postgres is under load it may not get a chance to autovacuum. I don't think our installation is especially high load, but if it is the case that Postgres never runs autovacuum under nominal load then maybe sentry self-hosted does need a way to explicitly trigger vacuum on some kind of schedule.
While it is true that vacuum locks the table and uses some disk space, and temporarily doubling our 42GB table during a vacuum might have causes some issues, it only took a couple of minutes to complete the vacuum and it significantly dropped the disk usage.
If vacuum was run regularly, the size should also hover at a much lower level (e.g. 3GB instead of 42GB) and should not get large enough to cause significant issues with disk usage during the vacuum.
We would be happy to accept a periodic 2 minute complete sentry outage to avoid unbounded disk usage, but it seems that sentry did not suffer a complete outage during the vacuum anyway. We're also willing to accept a partial outage or some amount of events that are missing or missing data.
We could also schedule vacuum to occur during a specified maintenance window to further reduce the impact to production systems.
I also tried the pg_repack option mentioned in the docks, but the current command did not run at all (failed to install pg_repack inside the container) and an older version of the command I found in a GitHub issue that matched our version of postgres also failed to install inside the container.
So I think a setting to schedule a vacuum full nodestore_node
in postgres is the best way forward. It could be disabled by default, but should be easy to enable via setting that allows a specific schedule to be set.
Reading more about vacuum and vacuum full. The latter rewrites the whole table and returns disk space back to the operating system. The former frees up deleted rows for re-use but does not return disk space back to the operating system. So if the sentry cleanup could run vacuum immediately after deleting rows from nodestore_node, and the cleanup frequency was high enough (or configurable) for the volume of new writes vs deletions of old data, then there should be no more unbounded growth without requiring a full table lock or any downtime. This should be easier to implement and more reliable than trying to configure the autovacuum daemon to do it for us?
https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-FOR-SPACE-RECOVERY
Thanks for the deep dive, @mrmachine.
In our environment, we have not yet upgraded to 23.4.0 as disk space is almost 80% full and cannot upgrade our postgres.
Therefore we have not yet tested if upgrading to 14.5 would have an effect on this matter.
We also encountered this issue, but it was more severe for us. Our nodestore_node table expanded to 3.6TB, so we had to focus on and resolve this problem. We tried to pull a new instance from backup data to verify how long a VACUUM FULL nodestore_node would take and found that it would take at least 8 hours. This means that if we perform this operation in production, Sentry would be down for 8 hours, which is unacceptable.
However, we clearly understand that nodestore_node is just a KV (key-value) abstract storage, using id as the key, to store or query event JSONs, each JSON being approximately 6~70 KB. Of course, the data stored in nodestore_node is already compressed. So I naturally thought of using an object storage product like S3, which can very well solve the storage problem of nodestore_node, using id as the file path, with each event JSON being a file.
Now, I have managed to use S3 to store the messages in nodestore_node and have submitted a PR :https://github.com/getsentry/sentry/pull/76250. I hope others facing similar issues can provide some suggestions.
postgresql size
oss files
Problem Statement
Over time
nodestore_node
table gets bigger and bigger and currently there is no procedure to clean it up.A forum comment explaining what
nodestore_node
is by @untitaker :https://forum.sentry.io/t/postgres-nodestore-node-table-124gb/12753/3
Solution Brainstorm
There was an idea suggested on forum which worked for me, but I lost all event details. Something like this would work:
Change 1 day according to your needs.
Maybe put this in a
cron
container which gets run every night, we should think about its performance issues though, this took a long time to get executed on our instance, maybe because it wasn't run before, but I'm not sure.