Open wojski opened 2 years ago
problem should be solved by PR #5734
I believe I am still experiencing a similar issue still on 2.29.0-stable-011a54a2.
After a little while, I am no longer able to get new blocks or new pending blocks via websocket subscription and have the message "error while notifying subscription" in my rpc daemon log.
I've noticed it seems to specifically happen when I am requesting txpool_contents.
We're still seeing this in 2.30.0
I am still seeing this error as well
Weird. I've been subscribing to new heads without issue after https://github.com/ledgerwatch/erigon/pull/5734 (reported in https://github.com/ledgerwatch/erigon/issues/5577) and merged into v2.28.1.
(I'm currently running v2.32.0)
it would be useful to get the pprof goroutine trace if possible when this happens to your node.
--prof --pprof.port=6060 --pprof.addr=0.0.0.0
and then go to http://node-ip:6060/debug/pprof and download the "full goroutine stack dump"
also if possible - run the rpcdaemon separately so there is less noise in the file.
koray's node seem to have locked here: https://github.com/ledgerwatch/erigon/blob/devel/turbo/rpchelper/filters.go#L526
goroutine 82 [chan send, 29 minutes]:
github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).OnNewEvent(0xc08c45c680, 0xc00007e120?)
github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:526 +0x32b
github.com/ledgerwatch/erigon/cmd/rpcdaemon/rpcservices.(*RemoteBackend).Subscribe(0xc08c443310, {0x1e22b68, 0xc0013c5a00}, 0xc025a56c00)
github.com/ledgerwatch/erigon/cmd/rpcdaemon/rpcservices/eth_backend.go:141 +0x1d5
github.com/ledgerwatch/erigon/turbo/rpchelper.New.func1()
github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:84 +0xe5
created by github.com/ledgerwatch/erigon/turbo/rpchelper.New
github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:73 +0x33f
since OnNewEvent seems to read lock mutex, stopping new subscribers from being added.
it could be multiple issues that cause same issue - the filters mutex goes into deadlock, perhaps it is partially fixed, but still other cases where it is not?
https://github.com/ledgerwatch/erigon/blob/devel/turbo/rpchelper/filters.go#L396
it seems to also be stuck here, so this is with the patch
goroutine 29199 [semacquire, 18 minutes]:
sync.runtime_SemacquireMutex(0x42a8a5?, 0x60?, 0x0?)
runtime/sema.go:71 +0x25
sync.(*RWMutex).Lock(0x1557540?)
sync/rwmutex.go:144 +0x71
github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).UnsubscribePendingTxs(0xc08c45c680, {0xc084710ec0, 0x40})
github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:396 +0xe6
github.com/ledgerwatch/erigon/cmd/rpcdaemon/commands.(*APIImpl).NewPendingTransactions.func1()
github.com/ledgerwatch/erigon/cmd/rpcdaemon/commands/eth_filters.go:188 +0x378
created by github.com/ledgerwatch/erigon/cmd/rpcdaemon/commands.(*APIImpl).NewPendingTransactions
github.com/ledgerwatch/erigon/cmd/rpcdaemon/commands/eth_filters.go:174 +0xdc
there must be another deadlock condition not being caught.
update:
goroutine 44 [chan send, 97 minutes]:
github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).OnNewTx(0xc063ebba00, 0xc07d3cb000)
github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:580 +0x171
github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).subscribeToPendingTransactions(0xc000df7a00?, {0x1e22b68, 0xc000df7a00}, {0x1e2ba80, 0xc05e435df0})
github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:216 +0x125
github.com/ledgerwatch/erigon/turbo/rpchelper.New.func3()
github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:132 +0x95
created by github.com/ledgerwatch/erigon/turbo/rpchelper.New
github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go:125 +0x48a
another one, same issue this time on the other filter. seems something is wrong with this logic for sending to all the subs.
Thanks for the advises from @elee1766 , i would keep tracking this issue in the future
Isn't this already fixed? I am not experiencing it any more.
One of the causes is fixed. But there appears to be at least one additional cause. I experienced the deadlock on some polygon nodes. They are running the maticnetwork fork, but that fork has this PR in it. Is there some command I should run if I see the deadlock happen again?On Jan 16, 2023, at 7:17 AM, Koray Koska @.***> wrote: Isn't this already fixed? I am not experiencing it any more.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
Why is this issue still open? That's worrying
Why is this issue still open? That's worrying
so i fixed one cause by redoing the subscriptino engine, but it seems there is still something else ?
@WyseNynja is it still happening? if so, yes. if you run your node with pprof the debug endpoint,
System information
Erigon version: erigon/2022.09.3/linux-amd64/go1.19.1
OS & Version:Linux
Commit hash : https://hub.docker.com/layers/thorax/erigon/v2022.09.03/images/sha256-a8d14345cf3b8f089e38b660508a11d9d7c7dd659d6906372aab187fbc8ee083?context=explore
Expected behaviour
Websocket with message:
{"jsonrpc":"2.0","id": 1, "method": "eth_subscribe", "params": ["newHeads"]}
Should return new blocks.
Actual behaviour
After some time no new messages came from WS.
In rpcDaemon I can find log: [WARN] [10-03|21:31:53.506] error while notifying subscription err="write tcp 172.22.0.10:8545->xxxxxxxx:55508: write: connection timed out"
RPC endpoint working correctly. WS is no longer responsible (I can connect to it, but no messages received) Its working again after restart container.
RPCNode:
RPCDeamon:
Steps to reproduce the behaviour
Use WS for some time with:
{"jsonrpc":"2.0","id": 1, "method": "eth_subscribe", "params": ["newHeads"]}
Backtrace
(Same issue on 09.01)