[Bug]: Extremely high load / CPU usage from lemmy_server processes since upgrading to 0.19/0.19.1

philmichel commented 10 months ago

Requirements

[X] Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
[X] Did you check to see if this issue already exists?
[X] Is this only a single bug? Do not put multiple bugs in one issue.
[X] Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

Since upgrading to Lemmy 0.19 (and subsequently 0.19.1 - the issue is the same across those versions) via docker-compose, my small single-user instance shows very high load. Many lemmy_server processes are spawned, each consuming one full core, slowing the system to a crawl.

Steps to Reproduce

Standard docker-compose install, running well since 0.17
Pulled new docker container for 0.19
Launch container, see many lemmy_server processes spawned

htop output:

Screen Shot 2023-12-26 at 18 51 54

Technical Details

Lemmy v0.19.1

2023-12-27T03:27:26.670191Z  INFO lemmy_db_schema::utils: Running Database migrations (This may take a long time)...
2023-12-27T03:27:26.676269Z  INFO lemmy_db_schema::utils: Database migrations complete.
2023-12-27T03:27:26.705211Z  INFO lemmy_server::code_migrations: Running user_updates_2020_04_02
2023-12-27T03:27:26.713803Z  INFO lemmy_server::code_migrations: 0 person rows updated.
2023-12-27T03:27:26.714399Z  INFO lemmy_server::code_migrations: Running community_updates_2020_04_02
2023-12-27T03:27:26.719684Z  INFO lemmy_server::code_migrations: 0 community rows updated.
2023-12-27T03:27:26.719996Z  INFO lemmy_server::code_migrations: Running post_updates_2020_04_03
2023-12-27T03:27:26.764083Z  INFO lemmy_server::code_migrations: 0 post rows updated.
2023-12-27T03:27:26.764558Z  INFO lemmy_server::code_migrations: Running comment_updates_2020_04_03
2023-12-27T03:27:26.895003Z  INFO lemmy_server::code_migrations: 0 comment rows updated.
2023-12-27T03:27:26.895530Z  INFO lemmy_server::code_migrations: Running private_message_updates_2020_05_05
2023-12-27T03:27:26.897438Z  INFO lemmy_server::code_migrations: 0 private message rows updated.
2023-12-27T03:27:26.897756Z  INFO lemmy_server::code_migrations: Running post_thumbnail_url_updates_2020_07_27
2023-12-27T03:27:26.913439Z  INFO lemmy_server::code_migrations: 0 Post thumbnail_url rows updated.
2023-12-27T03:27:26.913805Z  INFO lemmy_server::code_migrations: Running apub_columns_2021_02_02
2023-12-27T03:27:26.998645Z  INFO lemmy_server::code_migrations: Running instance_actor_2021_09_29
2023-12-27T03:27:27.013113Z  INFO lemmy_server::code_migrations: Running regenerate_public_keys_2022_07_05
2023-12-27T03:27:27.015137Z  INFO lemmy_server::code_migrations: Running initialize_local_site_2022_10_10
Federation enabled, host is l3mmy.com
Starting HTTP server at 0.0.0.0:8536
2023-12-27T03:27:27.059948Z  INFO lemmy_server::scheduled_tasks: Updating active site and community aggregates ...
2023-12-27T03:28:05.239542Z  INFO lemmy_server::scheduled_tasks: Done.
2023-12-27T03:28:05.239593Z  INFO lemmy_server::scheduled_tasks: Updating hot ranks for all history...
2023-12-27T03:28:05.650496Z  INFO lemmy_server::scheduled_tasks: Finished process_hot_ranks_in_batches execution for post_aggregates (processed 608 rows)
2023-12-27T03:28:07.578853Z  INFO lemmy_server::scheduled_tasks: Finished process_hot_ranks_in_batches execution for comment (processed 17108 rows)
2023-12-27T03:28:07.580181Z  INFO lemmy_server::scheduled_tasks: Finished process_hot_ranks_in_batches execution for community (processed 0 rows)
2023-12-27T03:28:07.580201Z  INFO lemmy_server::scheduled_tasks: Finished hot ranks update!
2023-12-27T03:28:07.580209Z  INFO lemmy_server::scheduled_tasks: Updating banned column if it expires ...
2023-12-27T03:28:07.662056Z  INFO lemmy_server::scheduled_tasks: Clearing old activities...
2023-12-27T03:28:07.817536Z  INFO lemmy_server::scheduled_tasks: Done.
2023-12-27T03:28:07.817579Z  INFO lemmy_server::scheduled_tasks: Overwriting deleted posts...
2023-12-27T03:28:07.853378Z  INFO lemmy_server::scheduled_tasks: Done.
2023-12-27T03:28:07.853394Z  INFO lemmy_server::scheduled_tasks: Overwriting deleted comments...
2023-12-27T03:28:08.062318Z  INFO lemmy_server::scheduled_tasks: Done.
2023-12-27T03:28:22.697818Z  WARN Error encountered while processing the incoming HTTP request: lemmy_server::root_span_builder: Unknown: request or response body error: operation timed out

Caused by:
    operation timed out

Stack backtrace:
   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
  16: <unknown>
  17: <unknown>
  18: <unknown>
  19: <unknown>
  20: <unknown>
  21: <unknown>
  22: <unknown>
  23: <unknown>
  24: <unknown>
  25: <unknown>
  26: <unknown>
  27: <unknown>
  28: <unknown>
  29: <unknown>
  30: <unknown>
  31: <unknown>
  32: <unknown>
  33: <unknown>
  34: <unknown>
  35: <unknown>
  36: <unknown>
  37: <unknown>
  38: <unknown>
  39: <unknown>
  40: <unknown>
  41: <unknown>
  42: <unknown>
  43: <unknown>
  44: <unknown>
  45: <unknown>
  46: <unknown>
  47: <unknown>
  48: <unknown>
  49: clone
   0: lemmy_apub::activity_lists::community
             at crates/apub/src/activity_lists.rs:103
   1: lemmy_apub::activities::community::announce::receive
             at crates/apub/src/activities/community/announce.rs:154
   2: lemmy_server::root_span_builder::HTTP request
           with http.method=POST http.scheme="https" http.host=l3mmy.com http.target=/inbox otel.kind="server" request_id=81a078ec-5f03-46cf-b085-d7b95b1f3028 http.status_code=400 otel.status_code="OK"
             at src/root_span_builder.rs:16
2023-12-27T03:28:27.365815Z DEBUG HTTP request{http.method=GET http.scheme="http" http.host=localhost:1234 http.target=/api/v3/post/list otel.kind="server" request_id=0c84b9d0-0851-460d-8e74-f4c78f31fb06}:list_posts{data=Query(GetPosts { type_: Some(Local), sort: Some(Active), page: None, limit: Some(20), community_id: None, community_name: None, saved_only: None, liked_only: None, disliked_only: None, page_cursor: None }) local_user_view=None}: lemmy_db_views::post_view: Post View Query: Query { sql: "SELECT \"post\".\"id\", \"post\".\"name\", \"post\".\"url\", \"post\".\"body\", \"post\".\"creator_id\", \"post\".\"community_id\", \"post\".\"removed\", \"post\".\"locked\", \"post\".\"published\", \"post\".\"updated\", \"post\".\"deleted\", \"post\".\"nsfw\", \"post\".\"embed_title\", \"post\".\"embed_description\", \"post\".\"thumbnail_url\", \"post\".\"ap_id\", \"post\".\"local\", \"post\".\"embed_video_url\", \"post\".\"language_id\", \"post\".\"featured_community\", \"post\".\"featured_local\", \"person\".\"id\", \"person\".\"name\", \"person\".\"display_name\", \"person\".\"avatar\", \"person\".\"banned\", \"person\".\"published\", \"person\".\"updated\", \"person\".\"actor_id\", \"person\".\"bio\", \"person\".\"local\", \"person\".\"private_key\", \"person\".\"public_key\", \"person\".\"last_refreshed_at\", \"person\".\"banner\", \"person\".\"deleted\", \"person\".\"inbox_url\", \"person\".\"shared_inbox_url\", \"person\".\"matrix_user_id\", \"person\".\"bot_account\", \"person\".\"ban_expires\", \"person\".\"instance_id\", \"community\".\"id\", \"community\".\"name\", \"community\".\"title\", \"community\".\"description\", \"community\".\"removed\", \"community\".\"published\", \"community\".\"updated\", \"community\".\"deleted\", \"community\".\"nsfw\", \"community\".\"actor_id\", \"community\".\"local\", \"community\".\"private_key\", \"community\".\"public_key\", \"community\".\"last_refreshed_at\", \"community\".\"icon\", \"community\".\"banner\", \"community\".\"followers_url\", \"community\".\"inbox_url\", \"community\".\"shared_inbox_url\", \"community\".\"hidden\", \"community\".\"posting_restricted_to_mods\", \"community\".\"instance_id\", \"community\".\"moderators_url\", \"community\".\"featured_url\", EXISTS (SELECT \"community_person_ban\".\"community_id\", \"community_person_ban\".\"person_id\", \"community_person_ban\".\"published\", \"community_person_ban\".\"expires\" FROM \"community_person_ban\" WHERE ((\"post_aggregates\".\"community_id\" = \"community_person_ban\".\"community_id\") AND (\"community_person_ban\".\"person_id\" = \"post_aggregates\".\"creator_id\"))), EXISTS (SELECT \"community_moderator\".\"community_id\", \"community_moderator\".\"person_id\", \"community_moderator\".\"published\" FROM \"community_moderator\" WHERE ((\"post_aggregates\".\"community_id\" = \"community_moderator\".\"community_id\") AND (\"community_moderator\".\"person_id\" = \"post_aggregates\".\"creator_id\"))), EXISTS (SELECT \"local_user\".\"id\", \"local_user\".\"person_id\", \"local_user\".\"password_encrypted\", \"local_user\".\"email\", \"local_user\".\"show_nsfw\", \"local_user\".\"theme\", \"local_user\".\"default_sort_type\", \"local_user\".\"default_listing_type\", \"local_user\".\"interface_language\", \"local_user\".\"show_avatars\", \"local_user\".\"send_notifications_to_email\", \"local_user\".\"show_scores\", \"local_user\".\"show_bot_accounts\", \"local_user\".\"show_read_posts\", \"local_user\".\"email_verified\", \"local_user\".\"accepted_application\", \"local_user\".\"totp_2fa_secret\", \"local_user\".\"open_links_in_new_tab\", \"local_user\".\"blur_nsfw\", \"local_user\".\"auto_expand\", \"local_user\".\"infinite_scroll_enabled\", \"local_user\".\"admin\", \"local_user\".\"post_listing_mode\", \"local_user\".\"totp_2fa_enabled\", \"local_user\".\"enable_keyboard_navigation\", \"local_user\".\"enable_animated_images\", \"local_user\".\"collapse_bot_comments\" FROM \"local_user\" WHERE ((\"post_aggregates\".\"creator_id\" = \"local_user\".\"person_id\") AND (\"local_user\".\"admin\" = $1))), \"post_aggregates\".\"post_id\", \"post_aggregates\".\"comments\", \"post_aggregates\".\"score\", \"post_aggregates\".\"upvotes\", \"post_aggregates\".\"downvotes\", \"post_aggregates\".\"published\", \"post_aggregates\".\"newest_comment_time_necro\", \"post_aggregates\".\"newest_comment_time\", \"post_aggregates\".\"featured_community\", \"post_aggregates\".\"featured_local\", \"post_aggregates\".\"hot_rank\", \"post_aggregates\".\"hot_rank_active\", \"post_aggregates\".\"community_id\", \"post_aggregates\".\"creator_id\", \"post_aggregates\".\"controversy_rank\", \"post_aggregates\".\"instance_id\", \"post_aggregates\".\"scaled_rank\", $2, $3, $4, $5, $6, coalesce((\"post_aggregates\".\"comments\" - $7), \"post_aggregates\".\"comments\") FROM (((\"post_aggregates\" INNER JOIN \"person\" ON (\"post_aggregates\".\"creator_id\" = \"person\".\"id\")) INNER JOIN \"community\" ON (\"post_aggregates\".\"community_id\" = \"community\".\"id\")) INNER JOIN \"post\" ON (\"post_aggregates\".\"post_id\" = \"post\".\"id\")) WHERE ((((((((\"community\".\"deleted\" = $8) AND (\"post\".\"deleted\" = $9)) AND (\"community\".\"removed\" = $10)) AND (\"post\".\"removed\" = $11)) AND (\"community\".\"local\" = $12)) AND (\"community\".\"hidden\" = $13)) AND (\"post\".\"nsfw\" = $14)) AND (\"community\".\"nsfw\" = $15)) ORDER BY \"post_aggregates\".\"featured_local\" DESC , \"post_aggregates\".\"hot_rank_active\" DESC , \"post_aggregates\".\"post_id\" DESC  LIMIT $16 OFFSET $17", binds: [true, None, false, false, false, None, None, false, false, false, false, true, false, false, false, 20, 0] }
2023-12-27T03:28:29.631976Z  WARN Error encountered while processing the incoming HTTP request: lemmy_server::root_span_builder: Unknown: expected value at line 2 column 5

Stack backtrace:
   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
  16: <unknown>
  17: <unknown>
  18: <unknown>
  19: <unknown>
  20: <unknown>
  21: <unknown>
  22: <unknown>
  23: <unknown>
  24: <unknown>
  25: <unknown>
  26: <unknown>
  27: <unknown>
  28: <unknown>
  29: <unknown>
  30: <unknown>
  31: <unknown>
  32: <unknown>
  33: <unknown>
  34: <unknown>
  35: <unknown>
  36: <unknown>
  37: <unknown>
  38: <unknown>
  39: <unknown>
  40: <unknown>
  41: <unknown>
  42: <unknown>
  43: <unknown>
  44: <unknown>
  45: <unknown>
  46: <unknown>
  47: <unknown>
  48: <unknown>
  49: clone
   0: lemmy_apub::activity_lists::community
             at crates/apub/src/activity_lists.rs:103
   1: lemmy_apub::activities::community::announce::receive
             at crates/apub/src/activities/community/announce.rs:154
   2: lemmy_server::root_span_builder::HTTP request
           with http.method=POST http.scheme="https" http.host=l3mmy.com http.target=/inbox otel.kind="server" request_id=41428f7b-17ed-43c5-86fd-cda6dc1d8029 http.status_code=400 otel.status_code="OK"
             at src/root_span_builder.rs:16
2023-12-27T03:28:58.203506Z  WARN Error encountered while processing the incoming HTTP request: lemmy_server::root_span_builder: Unknown: 

Stack backtrace:
   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
  16: <unknown>
  17: <unknown>
  18: <unknown>
  19: <unknown>
  20: <unknown>
  21: <unknown>
  22: <unknown>
  23: <unknown>
  24: <unknown>
  25: <unknown>
  26: <unknown>
  27: <unknown>
  28: <unknown>
  29: <unknown>
  30: <unknown>
  31: <unknown>
  32: <unknown>
  33: <unknown>
  34: <unknown>
  35: <unknown>
  36: <unknown>
  37: <unknown>
  38: <unknown>
  39: <unknown>
  40: <unknown>
  41: <unknown>
  42: <unknown>
  43: <unknown>
  44: <unknown>
  45: <unknown>
  46: <unknown>
  47: <unknown>
  48: clone
   0: lemmy_apub::insert_received_activity
           with ap_id=Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("lemmy.dbzer0.com")), port: None, path: "/activities/announce/like/ef5f24fe-e410-4ce7-84ec-765840d6ceb5", query: None, fragment: None }
             at crates/apub/src/lib.rs:191
   1: lemmy_apub::activities::community::announce::verify
             at crates/apub/src/activities/community/announce.rs:147
   2: lemmy_server::root_span_builder::HTTP request
           with http.method=POST http.scheme="https" http.host=l3mmy.com http.target=/inbox otel.kind="server" request_id=f688f472-d565-44ab-ad5d-2dc47d9a6766 http.status_code=400 otel.status_code="OK"
             at src/root_span_builder.rs:16
2023-12-27T03:28:59.305093Z  WARN Error encountered while processing the incoming HTTP request: lemmy_server::root_span_builder: Unknown: 

Stack backtrace:
   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
  16: <unknown>
  17: <unknown>
  18: <unknown>
  19: <unknown>
  20: <unknown>
  21: <unknown>
  22: <unknown>
  23: <unknown>
  24: <unknown>
  25: <unknown>
  26: <unknown>
  27: <unknown>
  28: <unknown>
  29: <unknown>
  30: <unknown>
  31: <unknown>
  32: <unknown>
  33: <unknown>
  34: <unknown>
  35: <unknown>
  36: <unknown>
  37: <unknown>
  38: <unknown>
  39: <unknown>
  40: <unknown>
  41: <unknown>
  42: <unknown>
  43: <unknown>
  44: <unknown>
  45: <unknown>
  46: <unknown>
  47: <unknown>
  48: clone
   0: lemmy_apub::insert_received_activity
           with ap_id=Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("lemmy.ml")), port: None, path: "/activities/announce/like/e54cc943-ebd4-4ff2-a50c-491355771b8c", query: None, fragment: None }
             at crates/apub/src/lib.rs:191
   1: lemmy_apub::activities::community::announce::verify
             at crates/apub/src/activities/community/announce.rs:147
   2: lemmy_server::root_span_builder::HTTP request
           with http.method=POST http.scheme="https" http.host=l3mmy.com http.target=/inbox otel.kind="server" request_id=af658021-0b91-4fbd-91c5-ab347f799c60 http.status_code=400 otel.status_code="OK"
             at src/root_span_builder.rs:16

Version

BE 0.19.1

Lemmy Instance URL

No response

philmichel commented 10 months ago

Database seems completely fine and does not show high usage at all:

2023-12-27 03:27:26.173 UTC [1] LOG:  starting PostgreSQL 15.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20231014) 13.2.1 20231014, 64-bit
2023-12-27 03:27:26.173 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-12-27 03:27:26.173 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-12-27 03:27:26.177 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-12-27 03:27:26.182 UTC [24] LOG:  database system was shut down at 2023-12-27 03:25:53 UTC
2023-12-27 03:27:26.200 UTC [1] LOG:  database system is ready to accept connections

KJee85 commented 10 months ago

I have been seeing roughly this same thing, I know there are a couple posts around CPU usage. My instance just myself and a handful of subs. I only have one lemmy_server but it's consuming 367%. I turned my pool_size as well but load doesn't seem to be coming from postgre either.

SnipeAT commented 10 months ago

I'm having the same issue. Seeing 50-97% CPU usage.

eric7456 commented 10 months ago

Seeing the same here. Have 1 CPU allocated and can visually see when I updated to 19.0

linux-cultist commented 10 months ago

@eric7456 : I think you may need more than 1 cpu now in Lemmy 0.19.

I have 3 cpus and they are all pretty busy most of the time. Before 0.19 the lemmy software used very little cpu, and most of it was on the database side. But now the lemmy process is quite cpu hungry.

It would of course be great if it was possible to reduce the cpu usage in 0.19 but so much was rewritten, so not sure its possible. Guess we will see.

philmichel commented 10 months ago

@linux-cultist It will peg as many cores as you throw at it to 100%, no matter how little activity is taking place on the server. Don’t think that’s normal.

Nutomic commented 10 months ago

It would be helpful if you can split Lemmy into multiple processes for different tasks, and see which one is causing the high cpu usage. Use the following in docker-compose.yml, whats important are the entrypoints:

  lemmy:
    image: dessalines/lemmy:0.19.1
    hostname: lemmy
    restart: always
    logging: *default-loggingg
    environment:
      - RUST_LOG="warn"
    volumes:
      - ./lemmy.hjson:/config/config.hjson:Z
    depends_on:
      - postgres
      - pictrss
    entrypoint: lemmy_server --disable-activity-sending --disable-scheduled-tasks

  lemmy-federate:
    image: dessalines/lemmy:0.19.1
    hostname: lemmy
    restart: always
    logging: *default-loggingg
    environment:
      - RUST_LOG="warn"
    volumes:
      - ./lemmy.hjson:/config/config.hjson:Z
    depends_on:
      - postgres
      - pictrs
    entrypoint: lemmy_server --disable-http-server --disable-scheduled-tasks

  lemmy-tasks:
    image: dessalines/lemmy:0.19.1
    hostname: lemmy
    restart: always
    logging: *default-loggingg
    environment:
      - RUST_LOG="warn"
    volumes:
      - ./lemmy.hjson:/config/config.hjson:Z
    depends_on:
      - postgres
      - pictrs
    entrypoint: lemmy_server --disable-activity-sending --disable-http-server

linux-cultist commented 10 months ago

@Nutomic: I dont think I have the bug myself, only like 10% usage sometimes on one of the lemmy processes:

The rest of the cpu usage is postgres. Maybe @philmichel gets a better picture.

eric7456 commented 10 months ago

@Nutomic

Here's my current config

Bumping to 4 cores + 4GB ram

KJee85 commented 10 months ago

Seeing the same thing as Eric, in that the federation is what is spiking:

SnipeAT commented 10 months ago

indeed, i'll turn off the server for now until there's an update

Nutomic commented 10 months ago

@phiresky Do you have any idea how to reduce the cpu usage for activity sending?

phiresky commented 10 months ago

It would be good to get some info on the state of the federation queue while this is happening. I don't think anyone has linked an instance here yet? Check https://phiresky.github.io/lemmy-federation-state/ .

It's expected right now that when the sending is down for a while it will take a up a lot of resources while the queue is catching up (since it will send out all pending events since the queue went down). That's what happened on lemm.ee. If it's consistently that high even when the queue is caught up it might be some bug.

Please enable info logging on the lemmy_federate process with RUST_LOG="warn,lemmy_federate=info" and provide output (it should contain lines like Federating to X instances (x dead, x lagging)
Please provide a link to a public instance with the issue or attach screenshots of the https://phiresky.github.io/lemmy-federation-state/site to see what state the queues are in.
Please also follow these steps to get info about database performance:

enable pg_stat_statements and auto_explain by making sure these lines exists in the postgresql config (customPostgresql.conf):
```
shared_preload_libraries=pg_stat_statements,auto_explain
pg_stat_statements.track = all
auto_explain.log_min_duration=5000ms
```
open a psql repl by running docker compose exec -it db -u postgres psql -d lemmy and reset the stats by calling create extension pg_stat_statements; select pg_stat_statements_reset()
wait an hour
post the outputs of docker compose exec -it db -u postgres -qtAX -d lemmy -c 'select json_agg(a) from (select * from pg_stat_statements order by total_exec_time desc limit 10) a;' > total_exec_time.json

and

docker compose exec -it db -u postgres -qtAX -d lemmy -c 'select json_agg(a) from (select * from pg_stat_statements order by mean_exec_time desc limit 10) a;' > mean_exec_time.json

And finally, please provide statistics about the db pool by sending the output of the prometheus endpoint with curl lemmy-container:10002/metrics

phiresky commented 10 months ago

Also, have any of you cleared your sent_activities tables after upgrading? that might also cause this to happen

sunaurus commented 10 months ago

@phiresky what do you think about adding some (maybe configurable) throttling to workers, so for example each worker would pause for X milliseconds between handling each activity? Just to prevent self-DoSing when catching up with a backlog.

phiresky commented 10 months ago

@sunaurus It seems like an idea but what would you use as a heuristic? Since you do want it to catch up as quick as possible and you don't want it to start lagging when you have many activities per second in general and your server can handle it.

Maybe it would be possible to only do it when "catching up" is detected (last_successful_activity_send < 1 hour ago) or smth but I don't know

sunaurus commented 10 months ago

I think having it as a tuneable parameter with some conservative default would ensure that smaller servers don't get killed by this, and bigger servers have a chance to tune it for their needs. I am assuming that if the trade-off is "catching up as quick as possible and making the instance super slow" vs "catching up a bit slower, but keeping the instance usable", most admins will prefer the latter.

On lemm.ee, working through the backlog at full speed had the Lemmy process completely using up all the vCPUs I threw at it, until I ran into DB bottlenecks. Under normal conditions (after it caught up), it's now averaging around 5% of 1 vCPU. I would probably aim for something like 50% CPU max, so assuming CPU use scales linearly with amount of activities processed per second, I would just multiply my average activities per second by 10 and use that as a limit for how many activities each worker should process in a second, under all circumstances (so even when already caught up). That would still ensure real-time federation under average conditions.

I THINK a maximum of "1 activity per second" (per worker) might be a nice default for most instances, and should work even on the smallest 1vCPU servers, but this would need some testing.

phiresky commented 10 months ago

"1 activity per second"

That seems far too low tbh. When i implemented the queue my reference was 50-100 activities/second (with horizontal scaling though). Can you say how many activities you have in your db in the past 24h?

and should work even on the smallest 1vCPU servers

I mean sure it will "work" for tiny servers but it will also mean any instance with anywhere close to 1 activity per second wil never catch up and even start lagging behind by days automatically

phiresky commented 10 months ago

What you say seems reasonable though maybe as a dynamic limit: Set the speed to 10x what the actual activity speed in the DB is. The question is just where to get that activity in the db stat from because if the instance was just down it's going to be wrong

sunaurus commented 10 months ago

I just checked, I have ~40k rows in sent_activity for most 24h periods in the past week. This is roughly 0.5 per second, so actually limiting to 1 per second, even lemm.ee would catch up with a backlog at the moment (but not super fast).

But making it dynamic seems like a good idea. I see your point about the instance being down making the average difficult to calculate, but maybe we can instead approximate the average by tracking changes in the size of the backlog.

We could use 1 activity per second as a baseline limit (after each activity, if a full second has not passed from the start of the iteration, the worker will sleep until the second has passed before starting the next iteration), and then dynamically change this limit in response to the size of the backlog from minute to minute. As long as the backlog for the worker is shrinking "enough" each minute, we can keep the throttling level constant, and if the backlog is not shrinking enough, we can incrementally reduce the throttling until it starts shrinking again. If we respond to the shrinking/growing gradually (maybe a maximum change of +- 50ms per minute to the delay between activities), then we can quite smoothly handle spikes in activity as well.

I think this might actually even relatively simple to implement, it seems like the only complicated part is deciding on what "shrinking enough" means. Maybe something like 0.1% of the backlog per minute?

Nutomic commented 10 months ago

I think a better solution would be to assign priorities to tasks, so that incoming HTTP requests are always handled before sending out activities. Unfortunately it seems that tokio doesnt have any such functionality.

Alternatively we could use the RuntimeMetrics from tokio, and check the number of available workers. Something like, if active_tasks_count() < num_workers() send out another activity. Though I dont see a way to account for idle tasks.

phiresky commented 10 months ago

check the number of available workers

Interesting idea, what might also work is use the db pool stats and start throttling if the pool has no available connections https://docs.rs/deadpool/latest/deadpool/struct.Status.html (since sunaurus's primary reason for this was that the database was overloaded?)

Either will have weird interaction with horizontal scaling though since then pools (both tokio and pg) are separate

sunaurus commented 10 months ago

Is prioritizing straightforward to add? It's probably a good idea for those running everything in a single process, but I still think throttling is necessary in general as well

phiresky commented 10 months ago

Can you maybe add detail of what you think exactly the issue was for you? Was it that the rest of the site wasn't able to do any DB queries? (Don't know how that can happen since you have separate pools for the federation process and the other one)

Or that your CPU was at 100% and the normal process didn't get any CPU time?

Or that your network was fully used?

eric7456 commented 10 months ago

Please enable info logging on the lemmy_federate process with RUST_LOG="warn,lemmy_federate=info" and provide output (it should contain lines like Federating to X instances (x dead, x lagging)

lemmy-federate_1 | Lemmy v0.19.1 lemmy-federate_1 | Federation enabled, host is 014450.xyz lemmy-federate_1 | Starting HTTP server at 0.0.0.0:8536 lemmy-federate_1 | 2024-01-06T17:58:09.959279Z INFO lemmy_federate: Federating to 1553/2231 instances (678 dead, 0 disallowed) lemmy-federate_1 | 2024-01-06T17:58:24.978823Z WARN lemmy_server: Received terminate, shutting down gracefully... lemmy-federate_1 | 2024-01-06T17:58:24.980630Z WARN lemmy_federate: Waiting for 1553 workers (30.00s max) lemmy-federate_1 | 2024-01-06T17:58:25.833808Z INFO lemmy_federate: Federation state as of 2024-01-06T17:58:25+00:00: lemmy-federate_1 | 2024-01-06T17:58:25.833933Z INFO lemmy_federate: 1553 others up to date. 0 instances behind. lemmy-federate_1 | Lemmy v0.19.1 lemmy-federate_1 | Federation enabled, host is 014450.xyz lemmy-federate_1 | Starting HTTP server at 0.0.0.0:8536 lemmy-federate_1 | 2024-01-06T18:00:19.196179Z INFO lemmy_federate: Federating to 1553/2231 instances (678 dead, 0 disallowed)

Please provide a link to a public instance with the issue or attach screenshots of the https://phiresky.github.io/lemmy-federation-state/site to see what state the queues are in.

Please also follow these steps to get info about database performance:

enable pg_stat_statements and auto_explain by making sure these lines exists in the postgresql config (customPostgresql.conf):

shared_preload_libraries=pg_stat_statements,auto_explain pg_stat_statements.track = all auto_explain.log_min_duration=5000ms open a psql repl by running docker compose exec -it db -u postgres psql -d lemmy and reset the stats by calling create extension pg_stat_statements; select pg_stat_statements_reset()

wait an hour

post the outputs of docker compose exec -it db -u postgres -qtAX -d lemmy -c 'select json_agg(a) from (select * from pg_stat_statements order by total_exec_time desc limit 10) a;' > total_exec_time.json

total_exec_time.json

and

docker compose exec -it db -u postgres -qtAX -d lemmy -c 'select json_agg(a) from (select * from pg_stat_statements order by mean_exec_time desc limit 10) a;' > mean_exec_time.json

mean_exec_time.json

And finally, please provide statistics about the db pool by sending the output of the prometheus endpoint with curl lemmy-container:10002/metrics

prometheus_metrics.log

Can you maybe add detail of what you think exactly the issue was for you? Was it that the rest of the site wasn't able to do any DB queries? (Don't know how that can happen since you have separate pools for the federation process and the other one)

Or that your CPU was at 100% and the normal process didn't get any CPU time?

Or that your network was fully used?

Only issue is the CPU stuck at 100%. The rest of the site works fine, maybe a bit slower than typical. No extra network usage that I could tell.

sunaurus commented 10 months ago

Can you maybe add detail of what you think exactly the issue was for you? Was it that the rest of the site wasn't able to do any DB queries?

This was the major problem on lemm.ee for sure - the CPU use of the worker was not a big deal, as I just moved the worker(s) onto separate hardware. Having said that, I can definitely see it being a problem for other instances that run on a single server, ESPECIALLY if the DB is also on that same server, as it is for many (most?) instances.

On lemm.ee, our database was just not ready for the extra load that came from full speed catch-up federation (as it resulted in tens of thousands of extra DB queries per second, which were quite cheap individually, but combined did have a significant effect).

I've now upgraded the lemm.ee database with a lot of extra headroom, so it's probably not a big issue going forward for lemm.ee, but I can definitely see some other instances running into the same issue.

phiresky commented 10 months ago

database was just not ready for the extra load that came from full speed catch-up federation

That's interesting because it also means that by just reducing the pool size of the fed queue should fix it because that way you can limit it to e.g. 10 concurrent queries per second

Nutomic commented 9 months ago

Here are postgres stats from lemmy.pt:

mean_exec_time.json total_exec_time.json

One problem is with get_community_follower_inboxes(). Also updating community aggregates is extremely slow for some reason. Lots of time spent on inserting received activities and votes which is probably unavoidable.

The total_exec_time.json by @eric7456 also shows different community aggregate updates using up an extreme amount of time.

phiresky commented 9 months ago

One problem is with get_community_follower_inboxes().

Do you mean get_instance_followed_community_inboxes? I don't think get_community_follower_inboxes is used anymore

sunaurus commented 9 months ago

@Nutomic: Also updating community aggregates is extremely slow for some reason.

I experienced this on lemm.ee as well. The same is also true for updating site aggregates. This was the same issue: https://github.com/LemmyNet/lemmy/issues/4306#issuecomment-1878086069

For now I solved it on lemm.ee by just throwing more hardware at the problem, but one thing I mentioned in the above thread:

It might also be a good idea to try and optimize the calculation on Lemmy side, for example through some denormalization (like storing last_activity timestamps for each user directly, rather than calculating it every 15 minutes for all users).

I actually did a quick hacky experiment with such denormalization on lemm.ee on my old hardware, and saw significant improvement. I added a last_activity_timestamp column directly to the person table, and changed the site aggregates calculation to not have any joins, but simply count all rows in the person table where the last_activity_timestamp was bigger than the time window we were aggregating for. The speed-up I got from that was 1-2 orders of magnitude, depending on the time window used.

I assume we could get a similar speed-up for for community aggregates as well by creating some new community_last_activity table, which just stores community_id, person_id, last_activity_timestamp.

Basically the trade-off here would be that we need to start updating the last activity timestamps whenever somebody adds a vote/comment/post, but this seems relatively cheap compared to the huge joins that we are doing right now, which postgres seems to struggle with.

Nutomic commented 9 months ago

@philmichel Youre right its get_instance_followed_community_inboxes(). However get_community_follower_inboxes() is also in use, but not listed as slow query.

@sunaurus Would be good if you can make a PR for that.

LemmyNet / lemmy