Configurable Activity Cleanup Duration

gabe565 commented 1 year ago

[x] Did you check to see if this issue already exists?
[x] Is this only a single feature request? Do not put multiple feature requests in one issue.
[ ] Is this a question or discussion? Don't use this, use https://lemmy.ml/c/lemmy_support.
[ ] Is this a UI / front end issue? Use the lemmy-ui repo.

Describe the feature request below

Hello! I recently started hosting my own Lemmy instance and love it. I noticed pretty consistent growth of the activity table and was curious how long that would continue to increase. I dug into the code and found that activities are pruned by a scheduled task after 6 months.

Would there be a downside to pruning more recent rows, like after 3 months or 1 month? If not, would it make sense for this interval to be configurable in config.hjson?

Thank you!

Nutomic commented 1 year ago

The table is mainly relevant for debugging. There is no problem with pruning it earlier or clearing it entirely. Extra config options always result in more complexity, so I would strongly prefer to change the hardcoded pruning interval instead. Feel free to make a PR.

FerrahWolfeh commented 1 year ago

I was thinking on putting the function that logs the activity entries to the database behind a #[cfg(debug_assertions)] or an env var, anything against that? If not, I'm comfortable setting up a PR

Nutomic commented 1 year ago

No its also useful in production, if you want to see the activities that are getting sent. Logs often only contain the activity ID but not the data itself, which needs to be retrieved from this database table. For that purpose, keeping entries for a month would be completely sufficient.

sunaurus commented 1 year ago

Edit: turns out there was a bug in a previous version of the nginx.config (https://github.com/LemmyNet/lemmy-ansible/issues/106), that was the root cause of my issue. I'm hiding my comments about this as they are not relevant here.

Some consumers (mainly Mastodon AFAICT) poll activities directly. After cleaning up activities older than a week on lemm.ee, I am seeing millions of requests per day for specific activities (by uuid) that result in 404.

Most common user agents requesting missing activities on lemm.ee for the past 24h:

Just mentioning this as a heads up - it seams that clearing the activity table is currently breaking somehing for Mastodon (but I am not a Mastodon user and have no idea what the full ramifications of this are)

lflare commented 1 year ago

This is from about ~3 weeks or so with about ~800 communities subscribed or so. This is definitely not sustainable IMO.

Nutomic commented 1 year ago

@sunaurus Can you find out which specific activities Mastodon is trying to fetch?

sunaurus commented 1 year ago

It's a mix of a different ones. I can see in the past 30 minutes these requests for /activities/follow/<uuid>, /activities/like/<uuid>, /activities/create/<uuid>, each with a bunch of different uuids. It seems quite evenly distributed between different uuids.

Nutomic commented 1 year ago

Hmm we could do some cleanup job with filter, so that those activity types get stored for a longer time. But anyway we cant keep them forever. I wonder what exactly Mastodon fetches these activities for, and what happens if it cant find them.

Another option would be to regenerate the activities on demand, but that would be quite complicated to implement, with no direct benefit for Lemmy.

Anyway I still think that we can significantly lower the cleanup interval in Lemmy. We dont require the activities at all after initial send/receive, and if other platforms require it they have to store it themselves or use a workaround.

sunaurus commented 1 year ago

I have dug a bit deeper in my logs, I now see that it's not only Mastodon requesting old deleted activities, but also several others, like Friendica, Calckey, Misskey, and even a few such requests from Akkoma and Pleroma user agents

sunaurus commented 1 year ago

I think I've been dealing with this issue https://github.com/LemmyNet/lemmy-ansible/issues/106 - it was already fixed in lemmy-ansible, and I have not applied this fix to my own nginx.conf. So most likely my comments are completely irrelevant, sorry. I will confirm it soon, if true, I will remove my comments here!

Edit: yep, confirmed, hiding my comments

ubergeek77 commented 1 year ago

Nutomic has stated that adding config options adds complexity, and I understand that, but I feel like there should be at least a little leeway here.

My small, single-user instance has been running for less than 1 month, and the activity table is nearly 10GB in size:

I'm running this on a modest VPS with not that much disk space, and this is chewing through that pretty fast. I didn't realize this is where all my data was going. I'm the only user on my instance, and all things considered I'm not subscribed to very much.

Larger instances might want to, and probably should, keep this table for auditing purposes, but single-user instances like mine have little to no use for this table. Even if the 6 month schedule is dropped down to 1 month, that will still leave me with 10GB of auxiliary data I will never have a use for. And maybe the 1 month being proposed here would be considered too short for a major public instance.

Since there are many different classes of servers running Lemmy instances here, I think a config option to change the cleanup schedule is warranted.

We could clear it manually, and maybe an alternative is to push people to do this manually in the documentation, but that feels a bit hacky to me 😅

lflare commented 1 year ago

This is from about ~3 weeks or so with about ~800 communities subscribed or so. This is definitely not sustainable IMO.

As of this reply, the table has grown to be 13 GiB. According to my internal metrics, I will run out of disk space on my SSD allocated for Lemmy in 2 days.

ubergeek77 commented 1 year ago

Thanks!

For those of us with limited storage, is there a recommended Postgres query we can run to safely clear this table while we wait for this change in the next version?

FerrahWolfeh commented 1 year ago

Thanks!

For those of us with limited storage, is there a recommended Postgres query we can run to safely clear this table while we wait for this change in the next version?

I clear it with: TRUNCATE TABLE activity;

then run a VACUUM; just to finish clearing space

justyns commented 1 year ago

Thanks!

For those of us with limited storage, is there a recommended Postgres query we can run to safely clear this table while we wait for this change in the next version?

I wasn't sure if it's safe to delete everything, so I've just been deleting things older than 30 days that aren't local:

delete from activity where local = false and published < NOW() - INTERVAL '30 days';

skariko commented 10 months ago

Thanks! For those of us with limited storage, is there a recommended Postgres query we can run to safely clear this table while we wait for this change in the next version?

I clear it with: TRUNCATE TABLE activity;

then run a VACUUM; just to finish clearing space

Hello, anyone know if is it still safe clear the table received_activity with:

TRUNCATE TABLE received_activity; ?

Mine is 9GB by now 😭

HorseJump commented 10 months ago

Thanks! For those of us with limited storage, is there a recommended Postgres query we can run to safely clear this table while we wait for this change in the next version?

I clear it with: TRUNCATE TABLE activity; then run a VACUUM; just to finish clearing space

Hello, anyone know if is it still safe clear the table received_activity with:

TRUNCATE TABLE received_activity; ?

Mine is 9GB by now 😭

I've been running delete from received_activity where published < NOW() - INTERVAL '30 days'; with no issues.

dessalines commented 10 months ago

We've lowered this to a week in a recent PR. It'll be in the next release.

LemmyNet / lemmy

Configurable Activity Cleanup Duration #3103