element-hq / dendrite

Dendrite is a second-generation Matrix homeserver written in Go!
https://element-hq.github.io/dendrite/
GNU Affero General Public License v3.0
33 stars 5 forks source link

Monolithic server client API becomes unresponsive until restarted #1676

Closed matrixbot closed 3 weeks ago

matrixbot commented 3 weeks ago

This issue was originally created by @jaywink at https://github.com/matrix-org/dendrite/issues/1676.

Background information

Description

My Dendrite server has become unresponsive two times in a 24 hour period. This is noticed by Element web (app.element.io) saying the server is offline. The console logs show sync endpoints timing out.

Restarting the docker container has caused the issue to resolve on both times.

Nothing interesting found in logs that I could spot, errors mainly related to fetching of remote auth events. Will submit a log file privately that covers the second time with an issue.

Dendrite is v0.3.4 post dev on hash bca2790c678887232266c5726be79b80ddd9930b (which is a part of https://github.com/matrix-org/dendrite/pull/1672). MSC2836 is enabled in settings.

While the unresponsiveness continues, the logs do show activity in the federation API's.

matrixbot commented 3 weeks ago

This comment was originally posted by @RobinJ1995 at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-753695781.

Can confirm this also happens in a polylith setup.

matrixbot commented 3 weeks ago

This comment was originally posted by @neilalexander at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-753939222.

The console logs show sync endpoints timing out.

Do you have more information on this, like how long it took for the request to time out?

matrixbot commented 3 weeks ago

This comment was originally posted by @jaywink at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-754602545.

Do you have more information on this, like how long it took for the request to time out?

Unfortunately no :/ It felt like minutes. It has not yet happened again, will ping if it does (at least if it doesn't recover by itself, I will notice).

matrixbot commented 3 weeks ago

This comment was originally posted by @PureTryOut at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-754646326.

So seems I had the same issue, as reported in #1685. For me it timed out after 120003ms which is about 2 minutes. And that is Nginx timing out, as I get a 504 Gateway timeout result.

matrixbot commented 3 weeks ago

This comment was originally posted by @neilalexander at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-758205833.

Does this still happen on Dendrite 0.3.5?

matrixbot commented 3 weeks ago

This comment was originally posted by @jaywink at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-759295478.

Does this still happen on Dendrite 0.3.5?

Yes, looks like this just happened sometime during the last 12 hours, just opened my browser and seeing sync time out, been on 0.3.5 since Monday or so.

Selection_981

Process is pretty much idling:

Selection_982

I've copied the logs over, will send you the packet should you want to have a look. I've not restarted the process yet, happy to give you access if you want - though I guess since it takes a while to happen it's probably not that useful?

matrixbot commented 3 weeks ago

This comment was originally posted by @jaywink at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-759298603.

Also, just to update, now running on vanilla untweaked 0.3.5 as my appservices patch was included in that.

matrixbot commented 3 weeks ago

This comment was originally posted by @jaywink at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-759301988.

Actually, it's not just sync, looks like a remote invite I tried to do also got timed out. From a remote Synapse:

{PUT-O-1566172} [jasonrobinson.me] Request failed: PUT matrix://jasonrobinson.me/_matrix/federation/v2/invite/%21abcdefgh%3Adomain.tld/redactedid: ResponseNeverReceived:[CancelledError()]

matrixbot commented 3 weeks ago

This comment was originally posted by @PureTryOut at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-759308065.

I must say that so far I can't reproduce the issue anymore. I'll give it a few more days though before I confirm for sure.

matrixbot commented 3 weeks ago

This comment was originally posted by @PureTryOut at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-762220174.

Actually, I can reproduce it again.

I actually think it doesn't have anything to do with the sync API persé, as federation also seems to be behind when I restart the service. The federation tester doesn't get a proper result back either.

matrixbot commented 3 weeks ago

This comment was originally posted by @kegsay at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-766770231.

Hmm, I wonder if we're leaking connections and they are all ending up in CLOSE_WAIT, starving our ability to service requests. Can you run netstat to see how many open sockets the process has got? cc @neilalexander

matrixbot commented 3 weeks ago

This comment was originally posted by @jaywink at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-767390591.

/etc/dendrite $ netstat -ae
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp        0      0 127.0.0.11:44527        0.0.0.0:*               LISTEN      
tcp        0      0 af6e9eaaeeeb:53554      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:58620      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:58500      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:49276      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53540      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53550      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53552      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53576      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:56624      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:54968      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53544      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:58068      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:58622      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:36256      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53542      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:54442      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53346      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38586      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:54190      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:58522      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:56412      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 :::8008                 :::*                    LISTEN      
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:38716  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:38380  ESTABLISHED 
udp        0      0 127.0.0.11:55670        0.0.0.0:* 

Doesn't look too likely? Unless I'm using the wrong command/args. The last time this happened was 4 days ago (based on when I had to last restart it). I can run this again once it happens again before restarting.

matrixbot commented 3 weeks ago

This comment was originally posted by @jaywink at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-768131342.

@Kegsay it happened again, and this is what netstat shows now

/etc/dendrite $ netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp        0      0 af6e9eaaeeeb:53554      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:58620      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38816      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38820      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38824      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:58500      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:60636      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38028      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53540      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53550      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38822      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53552      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:36162      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53576      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:54968      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53544      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:58068      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38120      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:58622      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:36256      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53542      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38814      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38818      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:54442      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38812      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38810      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:53346      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:38586      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:54190      dendrite-postgres.dendrite:postgresql ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56794  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53092  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:60258  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59114  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:60216  CLOSE_WAIT  
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54104  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58034  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59364  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57358  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54586  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53002  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53156  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55928  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54868  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57906  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59336  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57852  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53018  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55644  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53412  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55158  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57458  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56462  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57068  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54696  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54774  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53500  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55400  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56760  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57512  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56000  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56514  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55496  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57554  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55944  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:60256  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58004  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59450  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55348  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59706  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56398  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54628  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57342  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53590  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54718  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53036  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59724  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54526  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56014  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54140  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55428  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55094  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58236  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53198  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56924  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:52922  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57306  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54910  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57814  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56322  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58798  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54680  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59278  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56830  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55246  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58768  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55764  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55192  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:60108  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55908  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58618  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59308  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53644  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57232  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53562  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58956  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53362  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58242  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56816  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53818  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57704  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55382  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57034  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58444  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55850  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57100  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56436  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55102  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59426  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:60236  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53232  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58890  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55978  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57926  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56536  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:60070  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55174  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59084  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58120  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54004  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54072  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54172  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53432  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59182  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56978  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57488  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58112  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54120  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56484  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54972  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:55032  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56460  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:59608  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54322  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56178  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56242  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:56402  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:54850  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:57164  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:53990  ESTABLISHED 
tcp        0      0 af6e9eaaeeeb:8008       traefik.dendrite:58688  ESTABLISHED 
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node Path

/etc/dendrite $ netstat | wc -l
154
matrixbot commented 3 weeks ago

This comment was originally posted by @jaywink at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-768180856.

I'm running a netstat | wc -l in watch, it's now up to 171 172 :grin:

Edit 12 hours later: Now up to 310 before I knifed it to update to 0.3.7.

matrixbot commented 3 weeks ago

This comment was originally posted by @osmarks at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-770294640.

I ran into the same issue after only a few hours of running it, so it looks like it's not purely based on time. What logs do you need to help debug this?

matrixbot commented 3 weeks ago

This comment was originally posted by @neilalexander at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-775867944.

Can you all please confirm which architecture you are running on? amd64, aarch64, etc.

matrixbot commented 3 weeks ago

This comment was originally posted by @PureTryOut at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-775923218.

As mentioned on Matrix (posting it here to keep an overview), I'm running it on aarch64. Also Alpine Linux where Go is compiled with Musl rather than glibc (not sure if it matters).

matrixbot commented 3 weeks ago

This comment was originally posted by @osmarks at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-776909731.

Can you all please confirm which architecture you are running on? amd64, aarch64, etc.

My instance is on amd64. I haven't had the issue since then though.

matrixbot commented 3 weeks ago

This comment was originally posted by @jaywink at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-776993149.

Can you all please confirm which architecture you are running on? amd64, aarch64, etc.

amd64

matrixbot commented 3 weeks ago

This comment was originally posted by @viasux at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-814455331.

I believe I am having the same issue. Federation tester shows no errors and the logs are very vague.

matrixbot commented 3 weeks ago

This comment was originally posted by @yaliqmadiq at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-815868138.

Let's try to find out what goroutines are actually running at the time this behavior happens. Make sure your terminal has a massive buffer (10,000+ lines if possible) because I just tried to redirect sysout to a file and it doesn't work. (Opening a new issue to add a -o option to mono/poly)

Try this... next time the server freezes up like that:

ps -A | grep dend

and get the PID of dendrite

then do this:

kill -1 {pid}

Sending a SIGHUP or a SIGINT (kill code 2) should cause a Golang program like dendrite to spit out a full dump of all the goroutines that were running at the time of the halt including the callstack in each goroutine. This behavior smells like a goroutine leak in Dendrite.

At the end of the file there will be a massive dump of all the goroutines that were running and their callstacks which gives away which goroutine spawns are accumulating.

Probably capture the end part of the file where you sent SIGHUP to kill the server and grab that goroutine dump and paste it here. It might yield a clue as to who has spawned a goroutine leak.

matrixbot commented 3 weeks ago

This comment was originally posted by @jaywink at https://github.com/matrix-org/dendrite/issues/1676#issuecomment-1093466487.

This hasn't happened to me for a while now, when previously it used to happen pretty much consistently after some time. I haven't seen this since upgrading to v0.6.4 afaict.

Closing, if anyone sees this, please ping with fresh details.