Open philmichel opened 10 months ago
Database seems completely fine and does not show high usage at all:
2023-12-27 03:27:26.173 UTC [1] LOG: starting PostgreSQL 15.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20231014) 13.2.1 20231014, 64-bit
2023-12-27 03:27:26.173 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2023-12-27 03:27:26.173 UTC [1] LOG: listening on IPv6 address "::", port 5432
2023-12-27 03:27:26.177 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-12-27 03:27:26.182 UTC [24] LOG: database system was shut down at 2023-12-27 03:25:53 UTC
2023-12-27 03:27:26.200 UTC [1] LOG: database system is ready to accept connections
I have been seeing roughly this same thing, I know there are a couple posts around CPU usage. My instance just myself and a handful of subs. I only have one lemmy_server but it's consuming 367%. I turned my pool_size as well but load doesn't seem to be coming from postgre either.
I'm having the same issue. Seeing 50-97% CPU usage.
Seeing the same here. Have 1 CPU allocated and can visually see when I updated to 19.0
@eric7456 : I think you may need more than 1 cpu now in Lemmy 0.19.
I have 3 cpus and they are all pretty busy most of the time. Before 0.19 the lemmy software used very little cpu, and most of it was on the database side. But now the lemmy process is quite cpu hungry.
It would of course be great if it was possible to reduce the cpu usage in 0.19 but so much was rewritten, so not sure its possible. Guess we will see.
@linux-cultist It will peg as many cores as you throw at it to 100%, no matter how little activity is taking place on the server. Don’t think that’s normal.
It would be helpful if you can split Lemmy into multiple processes for different tasks, and see which one is causing the high cpu usage. Use the following in docker-compose.yml, whats important are the entrypoints:
lemmy:
image: dessalines/lemmy:0.19.1
hostname: lemmy
restart: always
logging: *default-loggingg
environment:
- RUST_LOG="warn"
volumes:
- ./lemmy.hjson:/config/config.hjson:Z
depends_on:
- postgres
- pictrss
entrypoint: lemmy_server --disable-activity-sending --disable-scheduled-tasks
lemmy-federate:
image: dessalines/lemmy:0.19.1
hostname: lemmy
restart: always
logging: *default-loggingg
environment:
- RUST_LOG="warn"
volumes:
- ./lemmy.hjson:/config/config.hjson:Z
depends_on:
- postgres
- pictrs
entrypoint: lemmy_server --disable-http-server --disable-scheduled-tasks
lemmy-tasks:
image: dessalines/lemmy:0.19.1
hostname: lemmy
restart: always
logging: *default-loggingg
environment:
- RUST_LOG="warn"
volumes:
- ./lemmy.hjson:/config/config.hjson:Z
depends_on:
- postgres
- pictrs
entrypoint: lemmy_server --disable-activity-sending --disable-http-server
@Nutomic: I dont think I have the bug myself, only like 10% usage sometimes on one of the lemmy processes:
The rest of the cpu usage is postgres. Maybe @philmichel gets a better picture.
@Nutomic
Here's my current config
Bumping to 4 cores + 4GB ram
Seeing the same thing as Eric, in that the federation is what is spiking:
indeed, i'll turn off the server for now until there's an update
@phiresky Do you have any idea how to reduce the cpu usage for activity sending?
It would be good to get some info on the state of the federation queue while this is happening. I don't think anyone has linked an instance here yet? Check https://phiresky.github.io/lemmy-federation-state/ .
It's expected right now that when the sending is down for a while it will take a up a lot of resources while the queue is catching up (since it will send out all pending events since the queue went down). That's what happened on lemm.ee. If it's consistently that high even when the queue is caught up it might be some bug.
Please enable info logging on the lemmy_federate process with RUST_LOG="warn,lemmy_federate=info"
and provide output (it should contain lines like Federating to X instances (x dead, x lagging)
Please provide a link to a public instance with the issue or attach screenshots of the https://phiresky.github.io/lemmy-federation-state/site to see what state the queues are in.
Please also follow these steps to get info about database performance:
customPostgresql.conf
):
shared_preload_libraries=pg_stat_statements,auto_explain
pg_stat_statements.track = all
auto_explain.log_min_duration=5000ms
psql
repl by running docker compose exec -it db -u postgres psql -d lemmy
and reset the stats by calling create extension pg_stat_statements; select pg_stat_statements_reset()
post the outputs of
docker compose exec -it db -u postgres -qtAX -d lemmy -c 'select json_agg(a) from (select * from pg_stat_statements order by total_exec_time desc limit 10) a;' > total_exec_time.json
and
docker compose exec -it db -u postgres -qtAX -d lemmy -c 'select json_agg(a) from (select * from pg_stat_statements order by mean_exec_time desc limit 10) a;' > mean_exec_time.json
curl lemmy-container:10002/metrics
Also, have any of you cleared your sent_activities tables after upgrading? that might also cause this to happen
@phiresky what do you think about adding some (maybe configurable) throttling to workers, so for example each worker would pause for X milliseconds between handling each activity? Just to prevent self-DoSing when catching up with a backlog.
@sunaurus It seems like an idea but what would you use as a heuristic? Since you do want it to catch up as quick as possible and you don't want it to start lagging when you have many activities per second in general and your server can handle it.
Maybe it would be possible to only do it when "catching up" is detected (last_successful_activity_send < 1 hour ago) or smth but I don't know
I think having it as a tuneable parameter with some conservative default would ensure that smaller servers don't get killed by this, and bigger servers have a chance to tune it for their needs. I am assuming that if the trade-off is "catching up as quick as possible and making the instance super slow" vs "catching up a bit slower, but keeping the instance usable", most admins will prefer the latter.
On lemm.ee, working through the backlog at full speed had the Lemmy process completely using up all the vCPUs I threw at it, until I ran into DB bottlenecks. Under normal conditions (after it caught up), it's now averaging around 5% of 1 vCPU. I would probably aim for something like 50% CPU max, so assuming CPU use scales linearly with amount of activities processed per second, I would just multiply my average activities per second by 10 and use that as a limit for how many activities each worker should process in a second, under all circumstances (so even when already caught up). That would still ensure real-time federation under average conditions.
I THINK a maximum of "1 activity per second" (per worker) might be a nice default for most instances, and should work even on the smallest 1vCPU servers, but this would need some testing.
"1 activity per second"
That seems far too low tbh. When i implemented the queue my reference was 50-100 activities/second (with horizontal scaling though). Can you say how many activities you have in your db in the past 24h?
and should work even on the smallest 1vCPU servers
I mean sure it will "work" for tiny servers but it will also mean any instance with anywhere close to 1 activity per second wil never catch up and even start lagging behind by days automatically
What you say seems reasonable though maybe as a dynamic limit: Set the speed to 10x what the actual activity speed in the DB is. The question is just where to get that activity in the db stat from because if the instance was just down it's going to be wrong
I just checked, I have ~40k rows in sent_activity
for most 24h periods in the past week. This is roughly 0.5 per second, so actually limiting to 1 per second, even lemm.ee would catch up with a backlog at the moment (but not super fast).
But making it dynamic seems like a good idea. I see your point about the instance being down making the average difficult to calculate, but maybe we can instead approximate the average by tracking changes in the size of the backlog.
We could use 1 activity per second as a baseline limit (after each activity, if a full second has not passed from the start of the iteration, the worker will sleep until the second has passed before starting the next iteration), and then dynamically change this limit in response to the size of the backlog from minute to minute. As long as the backlog for the worker is shrinking "enough" each minute, we can keep the throttling level constant, and if the backlog is not shrinking enough, we can incrementally reduce the throttling until it starts shrinking again. If we respond to the shrinking/growing gradually (maybe a maximum change of +- 50ms per minute to the delay between activities), then we can quite smoothly handle spikes in activity as well.
I think this might actually even relatively simple to implement, it seems like the only complicated part is deciding on what "shrinking enough" means. Maybe something like 0.1% of the backlog per minute?
I think a better solution would be to assign priorities to tasks, so that incoming HTTP requests are always handled before sending out activities. Unfortunately it seems that tokio doesnt have any such functionality.
Alternatively we could use the RuntimeMetrics from tokio, and check the number of available workers. Something like, if active_tasks_count() < num_workers()
send out another activity. Though I dont see a way to account for idle tasks.
check the number of available workers
Interesting idea, what might also work is use the db pool stats and start throttling if the pool has no available connections https://docs.rs/deadpool/latest/deadpool/struct.Status.html (since sunaurus's primary reason for this was that the database was overloaded?)
Either will have weird interaction with horizontal scaling though since then pools (both tokio and pg) are separate
Is prioritizing straightforward to add? It's probably a good idea for those running everything in a single process, but I still think throttling is necessary in general as well
Can you maybe add detail of what you think exactly the issue was for you? Was it that the rest of the site wasn't able to do any DB queries? (Don't know how that can happen since you have separate pools for the federation process and the other one)
Or that your CPU was at 100% and the normal process didn't get any CPU time?
Or that your network was fully used?
Please enable info logging on the lemmy_federate process with RUST_LOG="warn,lemmy_federate=info" and provide output (it should contain lines like Federating to X instances (x dead, x lagging)
lemmy-federate_1 | Lemmy v0.19.1 lemmy-federate_1 | Federation enabled, host is 014450.xyz lemmy-federate_1 | Starting HTTP server at 0.0.0.0:8536 lemmy-federate_1 | 2024-01-06T17:58:09.959279Z INFO lemmy_federate: Federating to 1553/2231 instances (678 dead, 0 disallowed) lemmy-federate_1 | 2024-01-06T17:58:24.978823Z WARN lemmy_server: Received terminate, shutting down gracefully... lemmy-federate_1 | 2024-01-06T17:58:24.980630Z WARN lemmy_federate: Waiting for 1553 workers (30.00s max) lemmy-federate_1 | 2024-01-06T17:58:25.833808Z INFO lemmy_federate: Federation state as of 2024-01-06T17:58:25+00:00: lemmy-federate_1 | 2024-01-06T17:58:25.833933Z INFO lemmy_federate: 1553 others up to date. 0 instances behind. lemmy-federate_1 | Lemmy v0.19.1 lemmy-federate_1 | Federation enabled, host is 014450.xyz lemmy-federate_1 | Starting HTTP server at 0.0.0.0:8536 lemmy-federate_1 | 2024-01-06T18:00:19.196179Z INFO lemmy_federate: Federating to 1553/2231 instances (678 dead, 0 disallowed)
Please provide a link to a public instance with the issue or attach screenshots of the https://phiresky.github.io/lemmy-federation-state/site to see what state the queues are in.
Please also follow these steps to get info about database performance:
enable pg_stat_statements and auto_explain by making sure these lines exists in the postgresql config (customPostgresql.conf):
shared_preload_libraries=pg_stat_statements,auto_explain pg_stat_statements.track = all auto_explain.log_min_duration=5000ms open a psql repl by running docker compose exec -it db -u postgres psql -d lemmy and reset the stats by calling create extension pg_stat_statements; select pg_stat_statements_reset()
wait an hour
post the outputs of docker compose exec -it db -u postgres -qtAX -d lemmy -c 'select json_agg(a) from (select * from pg_stat_statements order by total_exec_time desc limit 10) a;' > total_exec_time.json
and
docker compose exec -it db -u postgres -qtAX -d lemmy -c 'select json_agg(a) from (select * from pg_stat_statements order by mean_exec_time desc limit 10) a;' > mean_exec_time.json
And finally, please provide statistics about the db pool by sending the output of the prometheus endpoint with curl lemmy-container:10002/metrics
Can you maybe add detail of what you think exactly the issue was for you? Was it that the rest of the site wasn't able to do any DB queries? (Don't know how that can happen since you have separate pools for the federation process and the other one)
Or that your CPU was at 100% and the normal process didn't get any CPU time?
Or that your network was fully used?
Only issue is the CPU stuck at 100%. The rest of the site works fine, maybe a bit slower than typical. No extra network usage that I could tell.
Can you maybe add detail of what you think exactly the issue was for you? Was it that the rest of the site wasn't able to do any DB queries?
This was the major problem on lemm.ee for sure - the CPU use of the worker was not a big deal, as I just moved the worker(s) onto separate hardware. Having said that, I can definitely see it being a problem for other instances that run on a single server, ESPECIALLY if the DB is also on that same server, as it is for many (most?) instances.
On lemm.ee, our database was just not ready for the extra load that came from full speed catch-up federation (as it resulted in tens of thousands of extra DB queries per second, which were quite cheap individually, but combined did have a significant effect).
I've now upgraded the lemm.ee database with a lot of extra headroom, so it's probably not a big issue going forward for lemm.ee, but I can definitely see some other instances running into the same issue.
database was just not ready for the extra load that came from full speed catch-up federation
That's interesting because it also means that by just reducing the pool size of the fed queue should fix it because that way you can limit it to e.g. 10 concurrent queries per second
Here are postgres stats from lemmy.pt:
mean_exec_time.json total_exec_time.json
One problem is with get_community_follower_inboxes(). Also updating community aggregates is extremely slow for some reason. Lots of time spent on inserting received activities and votes which is probably unavoidable.
The total_exec_time.json by @eric7456 also shows different community aggregate updates using up an extreme amount of time.
One problem is with get_community_follower_inboxes().
Do you mean get_instance_followed_community_inboxes? I don't think get_community_follower_inboxes
is used anymore
@Nutomic: Also updating community aggregates is extremely slow for some reason.
I experienced this on lemm.ee as well. The same is also true for updating site aggregates. This was the same issue: https://github.com/LemmyNet/lemmy/issues/4306#issuecomment-1878086069
For now I solved it on lemm.ee by just throwing more hardware at the problem, but one thing I mentioned in the above thread:
It might also be a good idea to try and optimize the calculation on Lemmy side, for example through some denormalization (like storing last_activity timestamps for each user directly, rather than calculating it every 15 minutes for all users).
I actually did a quick hacky experiment with such denormalization on lemm.ee on my old hardware, and saw significant improvement. I added a last_activity_timestamp
column directly to the person
table, and changed the site aggregates calculation to not have any joins, but simply count all rows in the person table where the last_activity_timestamp
was bigger than the time window we were aggregating for. The speed-up I got from that was 1-2 orders of magnitude, depending on the time window used.
I assume we could get a similar speed-up for for community aggregates as well by creating some new community_last_activity table, which just stores community_id, person_id, last_activity_timestamp
.
Basically the trade-off here would be that we need to start updating the last activity timestamps whenever somebody adds a vote/comment/post, but this seems relatively cheap compared to the huge joins that we are doing right now, which postgres seems to struggle with.
@philmichel Youre right its get_instance_followed_community_inboxes()
. However get_community_follower_inboxes()
is also in use, but not listed as slow query.
@sunaurus Would be good if you can make a PR for that.
Requirements
Summary
Since upgrading to Lemmy 0.19 (and subsequently 0.19.1 - the issue is the same across those versions) via docker-compose, my small single-user instance shows very high load. Many
lemmy_server
processes are spawned, each consuming one full core, slowing the system to a crawl.Steps to Reproduce
htop output:
Technical Details
Lemmy v0.19.1
Version
BE 0.19.1
Lemmy Instance URL
No response