LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
12.95k stars 859 forks source link

[Bug]: 0.19.5 (and 0.19.4) InboxTimeout: InboxTimeout #4850

Open AgentScrubbles opened 2 weeks ago

AgentScrubbles commented 2 weeks ago

Requirements

Summary

After upgrading to 0.19.4 (or 0.19.5) from 0.19.3, I cannot access the UI, it times out. Checking Lemmy's logs, the following message repeats infinitely:

lemmy-1     | 2024-06-19T16:33:34.325525Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=<<instancename>> http.target=/inbox otel.kind="server" request_id=f86bfc9e-393c-44ad-adee-3c3f43cdb0e2
lemmy-1     |              at src/root_span_builder.rs:16

Steps to Reproduce

Unfortunately the only steps I have are from the upgrade itself, bumping the versions of the docker images. I'll add the docker versions below.

The only changes for the upgrade was pictrs from 0.5 to 0.5.16 and both Lemmy and Lemmy-UI to 0.19.4 (and then 5)

networks:
  # communication to web and clients
  externalproxy:
  localconnector:
  # communication between lemmy services
  lemmyinternal:
    driver: bridge
      # internal: true

services:
  certbot:
    image: certbot/certbot:latest
    volumes:
      - <<redacted>>:/var/www/certbot/:rw
      - <<redacted>>:/etc/letsencrypt/:rw

  proxy:
    image: nginx:1-alpine
    networks:
      - lemmyinternal
      - externalproxy
    ports:
      # only ports facing any connection from outside
      - 80:80 
      - 443:443
    volumes:
      - <<redacted>>:/etc/nginx/nginx.conf:ro
      - <<redacted>>:/etc/nginx/proxy.conf:ro
      - <<redacted>>:/var/www/certbot/:ro
      - <<redacted>>:/etc/nginx/ssl/:ro
      - <<redacted>>:/cert:ro
    restart: always
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
    depends_on:
      - pictrs
      - lemmy-ui

  lemmy:
    image: dessalines/lemmy:0.19.5
    hostname: lemmy
    networks:
      - externalproxy
      - lemmyinternal
    restart: always
    environment:
      - RUST_LOG=warn
      # - RUST_LOG="warn,lemmy_server=warn,lemmy_api=info,lemmy_api_common=info,lemmy_api_crud=info,lemmy_apub=info,lemmy_db_schema=info,lemmy_db_views=info,lemmy_db_views_actor=info,lemmy_db_views_moderator=info,lemmy_routes=info,lemmy_utils=info,lemmy_websocket=warn"
    volumes:
      - ./lemmy.hjson:/config/config.hjson
    depends_on:
      - postgres
      - pictrs
    deploy:
      resources:
        limits:
          memory: 500M
    logging:
      driver: "json-file"
      options:
        max-size: "50m"

  lemmy-ui:
    image: dessalines/lemmy-ui:0.19.5
    networks:
      - lemmyinternal
    environment:
      - LEMMY_UI_LEMMY_INTERNAL_HOST=lemmy:8536
      - LEMMY_UI_LEMMY_EXTERNAL_HOST=<<redacted>>
      - LEMMY_HTTPS=true
    depends_on:
      - lemmy
    restart: always
    volumes:
      - ./volumes/lemmy-ui/extra_themes:/app/extra_themes

  pictrs:
    image: asonix/pictrs:0.5.16
    hostname: pictrs
    networks:
      - lemmyinternal
    environment:
      - RUST_BACKTRACE=1
      - PICTRS__STORE__TYPE=object_storage
      - PICTRS__STORE__ENDPOINT=<<redacted>>
      - PICTRS__STORE__BUCKET_NAME=<<redacted>>
      - PICTRS__STORE__REGION=<<redacted>>
      - PICTRS__STORE__ACCESS_KEY=<<redacted>>
      - PICTRS__STORE__SECRET_KEY=<<redacted>>
      - PICTRS__STORE__USE_PATH_STYLE=false
      - PICTRS__API_KEY=API_KEY
      - PICTRS__SERVER__API_KEY=<<redacted>>
    user: 991:991
    volumes:
      - ./volumes/pictrs:/mnt
    restart: always

  postgres:
    image: postgres:16.2-alpine
    # this needs to match the database host in lemmy.hson
    hostname: postgres
    networks:
      - lemmyinternal
    environment:
      - POSTGRES_USER=<<redacted>>
      - POSTGRES_PASSWORD=<<redacted>>
      - POSTGRES_DB=<<redacted>>
    volumes:
      - <<volume>>:/var/lib/postgresql/data
      - <<volume>>:/backups
    restart: always
    deploy:
      resources:
        limits:
          memory: 1500M

Technical Details

postgres-1  | 
postgres-1  | PostgreSQL Database directory appears to contain a database; Skipping initialization
postgres-1  | 
lemmy-1     | Lemmy v0.19.5
postgres-1  | 2024-06-19 16:32:08.986 UTC [1] LOG:  starting PostgreSQL 16.2 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20231014) 13.2.1 20231014, 64-bit
postgres-1  | 2024-06-19 16:32:08.986 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
postgres-1  | 2024-06-19 16:32:08.986 UTC [1] LOG:  listening on IPv6 address "::", port 5432
postgres-1  | 2024-06-19 16:32:08.993 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
postgres-1  | 2024-06-19 16:32:09.010 UTC [25] LOG:  database system was shut down at 2024-06-19 16:31:45 UTC
postgres-1  | 2024-06-19 16:32:09.049 UTC [1] LOG:  database system is ready to accept connections
lemmy-1     | Federation enabled, host is <<host>>
lemmy-1     | Starting HTTP server at 0.0.0.0:8536
lemmy-1     | 2024-06-19T16:32:37.794541Z  WARN lemmy_server::root_span_builder: IncorrectLogin: IncorrectLogin
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=GET http.scheme="https" http.host=<<host>> http.target=/api/v3/private_message/list otel.kind="server" request_id=44289a0e-cf91-4219-8244-6e8a574e8969
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-19T16:32:44.106053Z  WARN lemmy_server::root_span_builder: Unknown: Domain "discuss.tchncs.de" is blocked
lemmy-1     |    0: lemmy_apub::objects::comment::verify
lemmy-1     |              at crates/apub/src/objects/comment.rs:139
lemmy-1     |    1: lemmy_apub::fetcher::post_or_comment::verify
lemmy-1     |              at crates/apub/src/fetcher/post_or_comment.rs:68
lemmy-1     |    2: lemmy_apub::activity_lists::community
lemmy-1     |              at crates/apub/src/activity_lists.rs:103
lemmy-1     |    3: lemmy_apub::activities::community::announce::receive
lemmy-1     |              at crates/apub/src/activities/community/announce.rs:161
lemmy-1     |    4: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=<<host>> http.target=/inbox otel.kind="server" request_id=fab1a369-7ffb-4af8-8768-ff43175ef648
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-19T16:32:53.886029Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=<<host>> http.target=/inbox otel.kind="server" request_id=40d9c3ee-cb31-4b86-b4c3-8dfaa058a9c3
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-19T16:32:57.904680Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=<<host>> http.target=/inbox otel.kind="server" request_id=c7449a12-f730-4e57-9bc3-c64f2d8eb643
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-19T16:32:59.652438Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=<<host>> http.target=/inbox otel.kind="server" request_id=0c7754ac-fa41-4673-9f40-00dd349b626b
lemmy-1     |              at src/root_span_builder.rs:16

Version

0.19.4

Lemmy Instance URL

poptalk.scrubbles.tech

Nutomic commented 2 weeks ago

I dont think that error is related to ui problems. What does /api/v3/site return? And are there any error messages logged for lemmy-ui on the server or in browser console?

AgentScrubbles commented 2 weeks ago

Okay the damndest thing, I had rolled back to 0.19.3, and to test what you were asking I just bumped back to 0.19.5 and everything worked perfectly, no issues at all.

So, only thing I can think of is that it was some first run thing. To test it I brought the entire stack down and back up, and everything again is working. I did see notes about inbox, but it seemed to be asking for the inbox from lemmy.ml? But those all passed with a 200 so I'm not sure what happened.

I'm not sure what caused it, maybe it was pictrs updating? Either way, I'm going to mark this as resolved. If anyone stumbles on this issue, I guess downgrade back to 0.19.3, let it fully start, and then upgrade back to 0.19.5? (Do a backup first of course)

AgentScrubbles commented 2 weeks ago

I'm reopening because it did start happening again, and I think I have more context. With the logs that @Nutomic requested, I can see that it's repeating /InboxTimeout when there's an underlying cause (maybe, I don't know how the code works, maybe just a guess)

Full logs including lemmy-ui for a timeout, it did eventually resolve:

2024-06-21T16:07:07.255646Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=poptalk.scrubbles.tech http.target=/inbox otel.kind="server" request_id=640a0fda-5996-458f-a8a8-f5eab6f40d30
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-21T16:07:07.975568Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=poptalk.scrubbles.tech http.target=/inbox otel.kind="server" request_id=ffde6c38-f4ac-4603-b2f3-daacfa50f631
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-21T16:07:10.535633Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=poptalk.scrubbles.tech http.target=/inbox otel.kind="server" request_id=84afb98d-3f8c-4d35-b83a-0c84b9b85ab6
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-21T16:07:11.030047Z  WARN lemmy_server::root_span_builder: IncorrectLogin: IncorrectLogin
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=GET http.scheme="https" http.host=poptalk.scrubbles.tech http.target=/api/v3/private_message/list otel.kind="server" request_id=d58c52d2-56a8-4ec1-9771-1ca25d076dc6
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-21T16:07:12.615836Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=poptalk.scrubbles.tech http.target=/inbox otel.kind="server" request_id=0bd05f02-ce1e-4a9e-96e5-4261730a60a5
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-21T16:07:13.175589Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=poptalk.scrubbles.tech http.target=/inbox otel.kind="server" request_id=ff62e555-6ff6-4fbf-880e-70f832f0952b
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-21T16:07:14.713999Z  WARN lemmy_server::root_span_builder: CouldntFindPerson: CouldntFindPerson
lemmy-1     |    0: lemmy_apub::api::read_person::read_person
lemmy-1     |            with data=Query(GetPersonDetails { person_id: None, username: Some("ShortBoweledClown@lemmy.one"), sort: Some(New), page: Some(1), limit: Some(20), community_id: None, saved_only: Some(false) }) local_user_view=None
lemmy-1     |              at crates/apub/src/api/read_person.rs:18
lemmy-1     |    1: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=GET http.scheme="http" http.host=lemmy:8536 http.target=/api/v3/user otel.kind="server" request_id=919d32ff-a531-439b-a366-8ea29e1a206c
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-1     | 2024-06-21T16:07:14.855723Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=poptalk.scrubbles.tech http.target=/inbox otel.kind="server" request_id=2542edc0-fcb1-461a-9a92-3424aa194f44
lemmy-1     |              at src/root_span_builder.rs:16
lemmy-ui-1  | Error: couldnt_find_person
lemmy-ui-1  |     at LemmyHttp.<anonymous> (/app/node_modules/.pnpm/lemmy-js-client@0.19.4/node_modules/lemmy-js-client/dist/http.js:930:19)
lemmy-ui-1  |     at Generator.next (<anonymous>)
lemmy-ui-1  |     at fulfilled (/app/node_modules/.pnpm/lemmy-js-client@0.19.4/node_modules/lemmy-js-client/dist/http.js:5:58)
lemmy-ui-1  |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
lemmy-1     | 2024-06-21T16:07:16.535695Z  WARN lemmy_server::root_span_builder: InboxTimeout: InboxTimeout
lemmy-1     |    0: lemmy_server::root_span_builder::HTTP request
lemmy-1     |            with http.method=POST http.scheme="https" http.host=poptalk.scrubbles.tech http.target=/inbox otel.kind="server" request_id=f1e0de6c-7c22-41a0-915a-05cd2a0d9dcf
lemmy-1     |              at src/root_span_builder.rs:16

No errors in the browser log.

So the page does eventually load, after a very long wait (20-30 seconds sometimes), with those logs repeating

dessalines commented 2 weeks ago

Those inboxtimeouts are almost certainly not related to the UI or API speed, so this is a different issue.

What UI pages are you getting slow responses from, all of them? Does api/v3/site suffer from the same problem?

AgentScrubbles commented 2 weeks ago

From what I can tell it's mostly the initial load of the page, API calls are slow, but it looks like once things get hydrated they back off and get faster. However the initial page loads usually still take a while, and while I can't say for sure, it did eventually crash after about 30 hours of usage before (where .3 did not). This is why I believe before it started working again, I probably waited longer for the site to be fully hydrated. However, intermittently it does happen again and the page loads jump back up to 1+ minutes for a response.

I think I agree, that the InboxTimeout may be a Red Herring, after reviewing the lemmy-ui logs I see many Request Timeouts, and I think that's what I'm seeing in my browser when I load, is that it's waiting on those to resolve and they're not. Something about sockets in there?

I'm attaching everything I can here. image github_logs.txt

@dessalines I'm sorry, I don't see any calls to api/v3/site, do you want me to just do a standard postman call to it by itself?

Nutomic commented 1 week ago

Ive noticed various times that the comments for a post are not loading, maybe thats related. Cant think of any reason though.

AgentScrubbles commented 1 week ago

Cant think of any reason though.

Trust me, I know that this is a developer's worst nightmare, the intermittent slowdown issues. Not a lot to go on. I want to say it's roughly every 24 hours - but there's no way to know for sure. Then I try later and it's perfectly fine. I'm keeping the devtools open today just to see if there's anything that stands out. If there's any other logging or metrics or anything you folks can think of I'll happily send it along.

AgentScrubbles commented 1 week ago

More graphs, how sometimes the three update calls are almost instantaneous, but a comment can take 11+ seconds image

AgentScrubbles commented 1 week ago

And now running htop, I'm seeing massive CPU spikes from postgres (which I assumed anyway) and Lemmy (calling it), but also massive, massive spikes from node dist/js/server.js. Is this lemmy-ui's host? Since lemmy is written in Go I'm surprised to see node in there, so I can only assume it's the host for lemmy-ui. Is there processing that that does? Maybe pictrs? Confirming with docker stats too, lemmy-ui is the resource hog right now.

And since node is single threaded it makes sense that if it gets a huge amount of stuff to do that it wouldn't be able to return the page in time, and cause timeouts. (I'm purely guessing on this one, just from my own experience there)

docker stats: image