Closed gabe565 closed 1 year ago
The table is mainly relevant for debugging. There is no problem with pruning it earlier or clearing it entirely. Extra config options always result in more complexity, so I would strongly prefer to change the hardcoded pruning interval instead. Feel free to make a PR.
I was thinking on putting the function that logs the activity entries to the database behind a #[cfg(debug_assertions)]
or an env var, anything against that? If not, I'm comfortable setting up a PR
No its also useful in production, if you want to see the activities that are getting sent. Logs often only contain the activity ID but not the data itself, which needs to be retrieved from this database table. For that purpose, keeping entries for a month would be completely sufficient.
Edit: turns out there was a bug in a previous version of the nginx.config (https://github.com/LemmyNet/lemmy-ansible/issues/106), that was the root cause of my issue. I'm hiding my comments about this as they are not relevant here.
Some consumers (mainly Mastodon AFAICT) poll activities directly. After cleaning up activities older than a week on lemm.ee, I am seeing millions of requests per day for specific activities (by uuid) that result in 404.
Most common user agents requesting missing activities on lemm.ee for the past 24h:
Just mentioning this as a heads up - it seams that clearing the activity table is currently breaking somehing for Mastodon (but I am not a Mastodon user and have no idea what the full ramifications of this are)
This is from about ~3 weeks or so with about ~800 communities subscribed or so. This is definitely not sustainable IMO.
@sunaurus Can you find out which specific activities Mastodon is trying to fetch?
It's a mix of a different ones. I can see in the past 30 minutes these requests for /activities/follow/<uuid>
, /activities/like/<uuid>
, /activities/create/<uuid>
, each with a bunch of different uuids. It seems quite evenly distributed between different uuids.
Hmm we could do some cleanup job with filter, so that those activity types get stored for a longer time. But anyway we cant keep them forever. I wonder what exactly Mastodon fetches these activities for, and what happens if it cant find them.
Another option would be to regenerate the activities on demand, but that would be quite complicated to implement, with no direct benefit for Lemmy.
Anyway I still think that we can significantly lower the cleanup interval in Lemmy. We dont require the activities at all after initial send/receive, and if other platforms require it they have to store it themselves or use a workaround.
I have dug a bit deeper in my logs, I now see that it's not only Mastodon requesting old deleted activities, but also several others, like Friendica, Calckey, Misskey, and even a few such requests from Akkoma and Pleroma user agents
I think I've been dealing with this issue https://github.com/LemmyNet/lemmy-ansible/issues/106 - it was already fixed in lemmy-ansible, and I have not applied this fix to my own nginx.conf. So most likely my comments are completely irrelevant, sorry. I will confirm it soon, if true, I will remove my comments here!
Edit: yep, confirmed, hiding my comments
Nutomic has stated that adding config options adds complexity, and I understand that, but I feel like there should be at least a little leeway here.
My small, single-user instance has been running for less than 1 month, and the activity
table is nearly 10GB in size:
I'm running this on a modest VPS with not that much disk space, and this is chewing through that pretty fast. I didn't realize this is where all my data was going. I'm the only user on my instance, and all things considered I'm not subscribed to very much.
Larger instances might want to, and probably should, keep this table for auditing purposes, but single-user instances like mine have little to no use for this table. Even if the 6 month schedule is dropped down to 1 month, that will still leave me with 10GB of auxiliary data I will never have a use for. And maybe the 1 month being proposed here would be considered too short for a major public instance.
Since there are many different classes of servers running Lemmy instances here, I think a config option to change the cleanup schedule is warranted.
We could clear it manually, and maybe an alternative is to push people to do this manually in the documentation, but that feels a bit hacky to me 😅
This is from about ~3 weeks or so with about ~800 communities subscribed or so. This is definitely not sustainable IMO.
As of this reply, the table has grown to be 13 GiB. According to my internal metrics, I will run out of disk space on my SSD allocated for Lemmy in 2 days.
Thanks!
For those of us with limited storage, is there a recommended Postgres query we can run to safely clear this table while we wait for this change in the next version?
Thanks!
For those of us with limited storage, is there a recommended Postgres query we can run to safely clear this table while we wait for this change in the next version?
I clear it with:
TRUNCATE TABLE activity;
then run a VACUUM;
just to finish clearing space
Thanks!
For those of us with limited storage, is there a recommended Postgres query we can run to safely clear this table while we wait for this change in the next version?
I wasn't sure if it's safe to delete everything, so I've just been deleting things older than 30 days that aren't local:
delete from activity where local = false and published < NOW() - INTERVAL '30 days';
Thanks! For those of us with limited storage, is there a recommended Postgres query we can run to safely clear this table while we wait for this change in the next version?
I clear it with:
TRUNCATE TABLE activity;
then run a
VACUUM;
just to finish clearing space
Hello,
anyone know if is it still safe clear the table received_activity
with:
TRUNCATE TABLE received_activity;
?
Mine is 9GB by now 😭
Thanks! For those of us with limited storage, is there a recommended Postgres query we can run to safely clear this table while we wait for this change in the next version?
I clear it with:
TRUNCATE TABLE activity;
then run aVACUUM;
just to finish clearing spaceHello, anyone know if is it still safe clear the table
received_activity
with:
TRUNCATE TABLE received_activity;
?Mine is 9GB by now 😭
I've been running delete from received_activity where published < NOW() - INTERVAL '30 days';
with no issues.
We've lowered this to a week in a recent PR. It'll be in the next release.
Describe the feature request below
Hello! I recently started hosting my own Lemmy instance and love it. I noticed pretty consistent growth of the
activity
table and was curious how long that would continue to increase. I dug into the code and found that activities are pruned by a scheduled task after 6 months.Would there be a downside to pruning more recent rows, like after 3 months or 1 month? If not, would it make sense for this interval to be configurable in
config.hjson
?Thank you!