Closed evd0kim closed 3 years ago
I am really sory about the style of this issue. But it was hard to me to even debug it and I hope you understand.
So I tried to debug such setup
VPN ---- Eclair
\---- Bitcoind
So I picked Core ZMQ Python script and installed python-zmq with its dependencies. Then I ran subscription python script on Eclair instance and it showed to me steady flow of events from ZMQ of bitcoin node. After that I noticed that Eclair finally started picking up tip blocks although I still see it about to be late 2-3 blocks behind the node. I have strong suggestion that I needed to instal ZMQ libs on Eclair instance but I also understand that Eclair code must be self-contaned and do not need any dependencies. Maybe it may be connected somehow to the fact that it needs to recieve and process blocks, so it will always 2-3 blocks behind the tip. But in about two days I didn't recieve blocks for a very long time and I got 4 force-closings probably as a result of trying to debug and fix this issue.
I see that it may be addressed some different way because Eclair anyway requests block explorers, so why it doesn't requests node for new block or network status when interblock time is about to expire.
While I was writing previous message Eclair was on the same height. Then it jumped from 682717 to 682722 in getinfo.
I attempted to measure python-zmq simple code vs Eclair.
Python recieved 682725 and 682726 headers with 20 sec delay (to block timestamp as it showed by Core) for 682726. While I'm writing this my Eclair is still on 682724.
An output of Python zmq script with timestamps when it receives headers of 25 and 26 blocks.
- RAW BLOCK HEADER (39) - 2021-05-09 09:41:37.844825
b'0000c020214f40f0d9e1b51b9eafa74b942f676d8688e221bc820d000000000000000000124edcb793b847eeee8a1b70aeb264b675ff95ab3dc721a75b4f092f1b7b5cd739ae976063a80d17040675d7'
- RAW BLOCK HEADER (40) - 2021-05-09 09:41:43.497660
b'00e0ff27db405edafde0264e08075407d02ab1c77a9b44adc1e20900000000000000000033f35a99befa81b5c435961f8896fb21834a02f992adfcf1639e86216951ad5241ae976063a80d17014a0ed9'
An output of getinfo
after that
{
"version": "0.5.2-SNAPSHOT-90fbcd3",
"nodeId": "02cd1b7bc418fac2dc99f0ba350d60fa6c45fde5ab6017ee14df6425df485fb1dd",
"alias": "SATM",
"color": "#b00b69",
"features": {
"activated": {
"gossip_queries": "optional",
"payment_secret": "optional",
"option_data_loss_protect": "optional",
"option_static_remotekey": "optional",
"basic_mpp": "optional",
"gossip_queries_ex": "optional",
"option_support_large_channel": "optional",
"var_onion_optin": "optional"
},
"unknown": []
},
"chainHash": "6fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d6190000000000",
"network": "mainnet",
"blockHeight": 682724,
"publicAddresses": [
"134.209.228.207:80"
],
"instanceId": "b5e26f83-98bc-4bb3-8b41-88c565322841"
}
The point is, why couldn't Eclair actively request bitcoind node for block? Because I see such behaviour that ZMQ could probably silently dropped, Eclair finds out it is late on the tip and peers disconnect, then Eclair doesn't do anything, just waiting for blocks or restart to reconnect to ZMQ again.
ZMQ should just work. There is no reason to add additional code and maintain it, the focus should instead simply be on making sure ZMQ works properly.
I have strong suggestion that I needed to instal ZMQ libs on Eclair instance but I also understand that Eclair code must be self-contaned and do not need any dependencies.
That would be very surprising. As you say, eclair is self-contained and shouldn't need this (and this is the first time we encounter this kind of issue).
Have you looked at the eclair logs to understand the issue? There should be a hint there, we never had any reports concerning issues with ZMQ subscriptions.
@t-bast
Thanks for response. When my Eclair node had significant lag I looked into logs and havent found anything related to zmq or bitcoind. I could only restart it so it was reconnecting then and syncing again.
I have many logs to research because I'm very interested why I had 4 force-closed channels but I need time for that.
For ZMQ I just found an alternative way to test if it works. And I see that Python script receives block headers while Eclair is always lagging. Always, at least 4-5 blocks. Nowadays it at least tries to keep up. Sadly, I can't certainly indentify way I actually fixed it since it all was stressfull and in a hurry. Besides of installing dependencies for python I also found a way to make tcp sending keep alive packets, like this
sudo sysctl -w net.ipv4.tcp_keepalive_time=600
And I enabled this also on Eclair and Bitcoin Core as well.
I suspect that this is just a matter of machine configuration and/or ZMQ configuration at the bitcoind
level.
Eclair doesn't do anything special, it's really vanilla ZMQ. If eclair doesn't receive events there, it's likely that there is an issue in the socket that transmits them from your bitcoind
machine to the eclair one.
Your comparison with the python script isn't comparing the right thing: eclair subscribes to blocks, not just block headers. Maybe try subscribing to blocks on your python script and see if that reproes the problem?
Sure I can do that.
@t-bast
It was rawblock
. I was referring to block headers because script printed out block headers.
The ZMQ logic is very simple in eclair: https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/zmq/ZMQActor.scala
Can you try to activate debug logs (only for ZMQ to not spam your log file, see instructions here to activate logs for a single component) and see if there are log lines saying that blocks are received (which would indicate that the issue is not in ZMQ but then in eclair when the block is analyzed)?
Ok. I will do that.
Thanks, this will let us know what the next debugging step will be.
If you see the log.debug("received blockid={}", block.blockId)
log line then the issue is inside eclair when we're analyzing the block, otherwise the issue is in the transmission between bitcoind
and eclair
(and then we can check the bitcoind
logs and focus on tweaking the connection parameters to figure out how to make it work).
What I have. Eclair restart moment
2021-05-10 15:09:25,974 INFO fr.acinq.eclair.Setup - current blockchain height=682947
meanwhile watchdog reports
2021-05-10 15:11:32,086 INFO akka.actor.ActorCell - blockstream.info: we are 0 blocks late
2021-05-10 15:11:32,460 INFO akka.actor.ActorCell - mempool.space: we are 0 blocks late
2021-05-10 15:11:32,639 INFO akka.actor.ActorCell - blockcypher.com: we are 0 blocks late
2021-05-10 15:11:33,063 INFO akka.actor.ActorCell - bitcoinheaders.net: we are 0 blocks late
40 mins later it is still 682947 and node height is 682955, I didn't see any ZMQ message other than with txid.
60 mins later Eclair tip is 682957 and it received one block ID.
2021-05-10 16:10:09,366 DEBUG f.a.e.b.b.zmq.ZMQActor - received blockid=0000000000000000000bbdb910e21056bdba946d1684fcc16e584914cc3cde1a
This is id of 682956 and not 682957 block. Why it happens? I see that it recieves constant flow of rawtx messages and likely it can't keep up processing them all so it maybe a bottleneck for actor. Maybe it makes sense to assign receiving an processing blocks to another actor?
After this ZMQ message watchdog signalled that everything is fine.
2021-05-10 16:11:20,540 INFO akka.actor.ActorCell - blockstream.info: we are 0 blocks late
2021-05-10 16:11:20,649 INFO akka.actor.ActorCell - blockcypher.com: we are 0 blocks late
2021-05-10 16:11:21,121 INFO akka.actor.ActorCell - mempool.space: we are 0 blocks late
2021-05-10 16:11:23,110 INFO akka.actor.ActorCell - bitcoinheaders.net: we are 0 blocks late
I understand that you do not want to bloat code but simultaneous request to Node's Bitcoin Core looks to me very logical and maybe it needs to be done a bit more frequently.
I've just had a quick look at zmq stuff and noticed there are two instances of ZMQActor
which do the same work here: https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/scala/fr/acinq/eclair/Setup.scala#L238-L239
Is this intentional?
40 mins later it is still 682947 and node height is 682955, I didn't see any ZMQ message other than with txid.
Ok so the issue is not that eclair is overloaded, it's that it just doesn't receive the information. That's what needs fixing, requesting the information by another mechanism would just be a work-around (and if there's a bandwidth issue between the eclair machine and the bitcoin machine, adding more requests would probably make it just worse).
To be honest I don't know how to investigate that, I'll have to do some searching. There should be a way to monitor on your bitcoind machine at what rate data is sent to the ZMQ connection. Your network/machine setup may also be responsible for some throttling. Can you check CPU/memory/bandwidth metrics per-process and try to isolate where the machine seems overloaded?
Is there latency between your eclair and bitcoin machines? What's the exact network setup? Do note that you really should make sure these two machines communicate very easily as they must exchange a lot of data to operate correctly. It's usually even recommended to run eclair and bitcoin on the same machine if you care about performance.
Is this intentional?
Yes, this is the recommended way of using ZMQ, one for the tx subscription and one for the block subscription.
Yes, this is the recommended way of using ZMQ, one for the tx subscription and one for the block subscription.
But both of those instances subscribe to both rawblock
and rawtx
: https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/zmq/ZMQActor.scala#L43-L44
I've checked that this is true, each event such as new tx gets printed out to console twice instead of once (as it should be?)
But both of those instances subscribe to both rawblock and rawtx: https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/zmq/ZMQActor.scala#L43-L44
Wow good catch, that is surely a bug indeed! And it means we consume twice the bandwidth necessary, which could be responsible for this issue in a separate machines environment :fearful:
I'll get that cleaned up ASAP.
Ok so the issue is not that eclair is overloaded, it's that it just doesn't receive the information.
With all respect, python code on the same host recieves blocks.
With all respect, python code on the same host recieves blocks.
But that's very likely because it doesn't configure the ZMQ socket the same way under the hood...
Do you know how exactly it connects to ZMQ? Does it use inproc
like we do?
That would be valuable information to troubleshoot.
And can you provide more details about the exact network setup? Without a repro this is quite hard to troubleshoot.
I will try to look into that as we go further but I can't promise that I will have enough time. But just top-level view is that it relies on system libraries and it maybe pretty efficient legacy code so yes, it can be faster.
Hey, @t-bast. My report after update:
Reference logs against Python:
TS is block timestamp, local time.
TS 2021-05-13 12:08:26 GMT +2
- RAW BLOCK (702) - 2021-05-13 10:09:09.610939 000000000000000000084577ef98ffee43884340617de2df48fb7e77c5bc7b29
2021-05-13 10:09:14,398 DEBUG f.a.e.b.b.zmq.ZMQActor - received blockid=000000000000000000084577ef98ffee43884340617de2df48fb7e77c5bc7b29
TS 2021-05-13 12:21:54 GMT +2
- RAW BLOCK (703) - 2021-05-13 10:22:28.553246 000000000000000000014e430aab800b09fa85c0f4fb237d16c686314da9f3c0
2021-05-13 10:22:40,045 DEBUG f.a.e.b.b.zmq.ZMQActor - received blockid=000000000000000000014e430aab800b09fa85c0f4fb237d16c686314da9f3c0
TS 2021-05-13 12:22:19 GMT +2
- RAW BLOCK (704) - 2021-05-13 10:22:48.211713 00000000000000000003f34e9732ffe5339046511002987a9ba224f5f5162ea5
2021-05-13 10:33:18,762 DEBUG f.a.e.b.b.zmq.ZMQActor - received blockid=00000000000000000003f34e9732ffe5339046511002987a9ba224f5f5162ea5
TS 2021-05-13 12:24:13 GMT +2
- RAW BLOCK (705) - 2021-05-13 10:24:38.525322 00000000000000000006b43ab544b1b55c72fea8ae35ec3df89cc1f508d57b73
2021-05-13 10:33:56,195 DEBUG f.a.e.b.b.zmq.ZMQActor - received blockid=00000000000000000006b43ab544b1b55c72fea8ae35ec3df89cc1f508d57b73
Not received by Eclair at the moment of report
- RAW BLOCK (706) - 2021-05-13 10:36:46.187801
0000000000000000000a19967f588a7e1b3bcd8e89acf7bc4a02f98af3f7826b
As you can see Eclair timestamps vary significantly. However it shouldn't be so critical anymore. Also before my node was very slow in sending payments. I suspect it was due to synchronization lag. I haven't tested payments yet.
By the way @t-bast. Could you please recommend minimal requirements for JVM and an example how to provide them into run-eclair.sh
?
Have you looked at performance metrics for your node (CPU, memory, etc)? Eclair exports a lot of these metrics, and you should have a look at system metrics as well. That would be a useful information to see where there are issues.
What kind of machine are you using for eclair? Are there other processes running on that machine?
You can pass arguments to the JVM using this syntax: https://github.com/ACINQ/eclair/blob/55a629f11de5fc10066016978eae8a417e427858/eclair-node/src/main/resources/eclair-node.sh#L326
We strongly recommend the OpenJDK 11: https://github.com/acinq/eclair#installing-eclair, can you check what you have on your machine?
I'm aware of this line in run-eclair, I just asked for examples.
Node runs on 5-dollar Debian instance on Digital Ocean. 1 CPU, 1 GB RAM, OpenJDK 11. Only postgres runs together with it. I'm controlling node performance independently from DO dashboard. On every new channel I have a screenshot for CPU/RAM/IO and network activity.
But it seems computational characteristics of the node VPS are pretty irrelevant. Eclair works with Bitcoin Core unreliably, here is why.
After measuring effect of recent updates the Core has disconnected from Eclair for one block. After that the connection was restored and checked by running Python ZMQ script. Eclair haven't restored connection of any type and it seems ZMQ actor just stopped working.
Eclair wasn't receiving blocks during 12 hours or so until I forcefully reset node. Watchdog was silent and I suppose the sole purpose of this actor is to protect against Eclipse attack... which is pretty bad because why not to do it independent and to add functionality like resetting ZMQ actor.
I'm aware of this line in run-eclair, I just asked for examples.
I don't know what kind of example you want, this is very hand-wavy so what do you expect...? It simply depends on what you want to tweak based on what type of node you run and what traffic you expect :confused:
For example on our nodes we usually increase the memory allocated to the JVM if we have a big enough machine (e.g. -J-Xmx8G
). It depends on your machine ¯ (ツ)/¯
Node runs on 5-dollar Debian instance on Digital Ocean. 1 CPU, 1 GB RAM, OpenJDK 11. Only postgres runs together with it.
This is likely too small. Eclair is designed for aggressive parallelism (and for big nodes), so having a single CPU defeats the whole design (and it's even likely a vCPU). Only 1GB of RAM is probably a limiting factor as well, especially if you're running postgres on the same machine. You will likely have IO issues as well if postgres is on the same machine and its drive is slow (you didn't mention what type of drive is used).
To be honest, with such a small machine, you must investigate performance in more details. Have you setup monitoring? This would let you see exactly what actors are under load, how much time is spent in each actor's mailbox, etc. We can't really "blind debug" performance issue, that's why we have a lot of metrics to easily zoom in on problems.
Eclair wasn't receiving blocks during 12 hours or so until I forcefully reset node. Watchdog was silent and I suppose the sole purpose of this actor is to protect against Eclipse attack... which is pretty bad because why not to do it independent and to add functionality like resetting ZMQ actor.
You're focusing too much on the effect, not the cause. What we need to fix is the root cause, not add bandages on top of something that doesn't work. Again, we need more details about the exact bottlenecks to figure out if something needs fixing, or if your mailboxes are simply full because the machine isn't powerful enough. The fact that your python script receives the events seems to indicate that there may be no issues at the socket/network layer (unless it's configured differently, and maybe we can tweak how the socket is created in eclair) so the issue may be in the actor mailbox, which we'd know from the performance metrics.
@t-bast with all respect, you are wrong.
I understand that maybe I'm not welcomed here because of weak VPS but I have records.
So this is stats for 7 days of Eclair working with 2 channels. As you can see the node is working pretty fine.
Current RAM stat (also 2 channels)
total used free shared buff/cache available
Mem: 1013660 572644 96372 68152 344644 231036
Swap: 2097148 23040 2074108
But this is not an argument actually, because Eclair has multiple problems: BlockchainWatchdog doesn't do anything because it sleeps, and ZMQ actor can't realize that the connection dropped in the meantime.
Some people suggested to look at ZMQ documentation and I see that some common patterns aren't working in Eclair.
@t-bast with all respect, you are wrong.
No worries, I'm happy to be proven wrong if that lets us fix issues and improve eclair. I do need data to back that up though as I cannot do any meaningful investigation without it (especially since you're the only node that reported this kind of issue in years).
Your perf metrics show that there is no system-wide obvious bottleneck, which is great. That allows us to make some progress on the investigation. What would be great now would be to have eclair metrics as suggested here, it would be a breeze to troubleshoot if we had that (in particular I'd like to see mailbox states to know if the ZMQ actor is overloaded or if it's just not receiving the events). We've never implemented the prometheus export (because no-one asked for it yet) but if you want to work on that it would be a good contribution, otherwise you can use Kamon to get something setup quickly.
Some people suggested to look at ZMQ documentation and I see that some common patterns aren't working in Eclair.
Can you be specific? Hand-wavy comments like that aren't helpful at all if we want to avoid wasting everyone's time.
Okay, I will try to enable Kamon. In my setup Core is true remote node. Even geographically. However, Eclair and Core work under Wireguard VPN which basically makes setup similar as if they were on the same powerful machine with one notable exception: network failures.
Can you be specific? Hand-wavy comments like that aren't helpful at all if we want to avoid wasting everyone's time.
Sorry, currently I can't. I was only using ZMQ in my projects for some time and they were Python prototypes.
@t-bast What particular metrics would you suggest to track in the context of this issue?
Do you know if bitcoin core would log such ZMQ failures on its side (since it's the bitcoin core side that sends messages)? I'll check to see what logs we can have from ZMQ on the eclair side, but we'd probably need to create a custom branch to enable more logs.
The metrics I'm interested in are actors time in mailbox. If these mailboxes are full, it would point to a perf issue in eclair's treatment of ZMQ messages. If they're not, your idea of network drops that aren't correctly caught by our config of ZMQ would be most likely.
Do you know if bitcoin core would log such ZMQ failures on its side (since it's the bitcoin core side that sends messages)?
In my particular situation I think it is more like this.
Pub-sub: if the client dies (having gotten some data), the server doesn’t know about it.
So Core just cant receive new blocks because of disrupted network and it cant provide new blocks via ZMQ while it doesn't track if client receives anything.
Thanks for the metrics. Indeed your mailboxes don't seem congested at all, so we can rule out the perf issue on the eclair side. Given your system metrics and eclair metrics, your theory of a "stuck" connection because of a network drop sounds much more likely, but I'm very surprised nobody encountered this before (or maybe nobody noticed it, which is even more worrying).
I'll try to simulate a flaky ZMQ connection to see if I can reproduce, I haven't used it much so I thought it would handle this more nicely out of the box, but we probably need to apply some configuration changes...meanwhile if you're able to figure out how your python prototype setups the connection and if it does something specific, it would be helpful.
You're not seeing this log line, are you?
Maybe what you're seeing is similar to https://github.com/zeromq/libzmq/issues/1199 It's a bit unclear from this issue what the best practice is to fix this though, still needs some digging.
It could also be linked to the watermark, as we leave default values: https://github.com/zeromq/jeromq/wiki It looks like this is going to be a nice rabbit hole :smile:
Can you try adding this line:
subscriber.setTCPKeepAlive(1)
Below the following line: https://github.com/ACINQ/eclair/blob/898c17bc76c1af9077314958612ad96c019bd381/eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/zmq/ZMQActor.scala#L43
It could be a very simple fix if keep-alive is what's missing.
I found some strings in old logs
2021-05-09 05:21:47,939 WARN f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
2021-05-09 05:21:48,119 WARN f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
2021-05-09 09:57:23,827 WARN f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
2021-05-09 09:57:23,955 WARN f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
So yes, it detects disruption.
I will add this string and rebuild. Then report back.
Regarding ZMQ experience I can say for sure that on good hosting there is no problem with it usually. I had LND about 2 or 3 years ago on Google and It worked just fine.
So, now Eclair works fine. I've simulated disruption manually.
2021-05-15 11:19:21,926 INFO f.a.e.b.b.zmq.ZMQActor - connected to tcp://192.168.2.7:29000
2021-05-15 11:19:21,936 INFO f.a.e.b.b.zmq.ZMQActor - connected to tcp://192.168.2.7:29000
2021-05-15 15:15:49,546 WARN f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
2021-05-15 15:16:05,865 WARN f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
2021-05-15 15:27:33,925 INFO f.a.e.b.b.zmq.ZMQActor - connected to tcp://192.168.2.7:29000
2021-05-15 15:27:41,978 INFO f.a.e.b.b.zmq.ZMQActor - connected to tcp://192.168.2.7:29000
ZMQ restored connection and downloaded missed blocks.
There were 3 blocks.
After receiving 3 blocks, Eclair then continued to work as it does (1 block).
The disruption weren't caught by Watchdog. I see it is normal because ZMQ was restored 1 min before Watchdog started requesting explorers.
2021-05-15 14:54:39,571 INFO akka.actor.ActorCell - blockcypher.com: we are 0 blocks late
2021-05-15 14:54:40,100 INFO akka.actor.ActorCell - bitcoinheaders.net: we are 0 blocks late
2021-05-15 15:29:20,933 INFO akka.actor.ActorCell - blockcypher.com: we are 0 blocks late
2021-05-15 15:29:21,452 INFO akka.actor.ActorCell - mempool.space: we are 0 blocks late
2021-05-15 15:29:21,992 INFO akka.actor.ActorCell - bitcoinheaders.net: we are 0 blocks late
2021-05-15 15:29:22,438 INFO akka.actor.ActorCell - blockstream.info: we are 0 blocks late
2021-05-15 15:38:18,149 INFO akka.actor.ActorCell - blockcypher.com: we are 0 blocks late
2021-05-15 15:38:18,151 INFO akka.actor.ActorCell - blockstream.info: we are 0 blocks late
2021-05-15 15:38:19,132 INFO akka.actor.ActorCell - mempool.space: we are 0 blocks late
2021-05-15 15:38:19,318 INFO akka.actor.ActorCell - bitcoinheaders.net: we are 0 blocks late
Great, so that means adding that keep-alive was the missing bit? I don't understand why it's not on by default to be honest, I really thought that would be included in the default behavior.
Let's keep it running for a few days to make sure it's the culprit, and if everything works fine I'll open a PR to add this, and I'll try to add some flaky connection tests as well.
It looks like the latency/throughput isn't great between your eclair node and bitcoind node (8 seconds for the checkconfirmed calls seems a lot), you should have metrics that check the exact latency for RPC calls.
It's not critical though, if you don't have many channels it doesn't matter that much if there's latency there.
Well, there are two levels of improvement to me
1) eclair works better with subscription fix 2) eclair is reconnecting. Maybe I will test it additionally or occasionally next days
Latency varies indeed. But I see a lot of potential in using remote node wallet for onchain operations. In many situations it may be beneficial from security point of view. Maybe even combine it with HSM however, I won't be considering it for my needs in near future.
Eclair was constantly behind the tip in one of my setups.
I have 2 bitcoind full nodes. I can switch between them and Eclair runs VPS. Full node and Eclair node are in VPN network. One of these nodes is slow but it seems it can provide blocks via ZMQ reliable. Other is not and I currently have no idea why. I tried to use keep alive option for tcp sockets but it seems it didn't help or I did something wrong.
The point is, why couldn't Eclair actively request bitcoind node for block? Because I see such behaviour that ZMQ could probably silently dropped, Eclair finds out it is late on the tip and peers disconnect, then Eclair doesn't do anything, just waiting for blocks or restart to reconnect to ZMQ again.