ACINQ / eclair

A scala implementation of the Lightning Network.
Apache License 2.0
1.24k stars 266 forks source link

Eclair can't keep up with Bitcoind #1789

Closed evd0kim closed 3 years ago

evd0kim commented 3 years ago

Eclair was constantly behind the tip in one of my setups.

I have 2 bitcoind full nodes. I can switch between them and Eclair runs VPS. Full node and Eclair node are in VPN network. One of these nodes is slow but it seems it can provide blocks via ZMQ reliable. Other is not and I currently have no idea why. I tried to use keep alive option for tcp sockets but it seems it didn't help or I did something wrong.

The point is, why couldn't Eclair actively request bitcoind node for block? Because I see such behaviour that ZMQ could probably silently dropped, Eclair finds out it is late on the tip and peers disconnect, then Eclair doesn't do anything, just waiting for blocks or restart to reconnect to ZMQ again.

evd0kim commented 3 years ago

I am really sory about the style of this issue. But it was hard to me to even debug it and I hope you understand.

So I tried to debug such setup

VPN ---- Eclair
    \---- Bitcoind

So I picked Core ZMQ Python script and installed python-zmq with its dependencies. Then I ran subscription python script on Eclair instance and it showed to me steady flow of events from ZMQ of bitcoin node. After that I noticed that Eclair finally started picking up tip blocks although I still see it about to be late 2-3 blocks behind the node. I have strong suggestion that I needed to instal ZMQ libs on Eclair instance but I also understand that Eclair code must be self-contaned and do not need any dependencies. Maybe it may be connected somehow to the fact that it needs to recieve and process blocks, so it will always 2-3 blocks behind the tip. But in about two days I didn't recieve blocks for a very long time and I got 4 force-closings probably as a result of trying to debug and fix this issue.

I see that it may be addressed some different way because Eclair anyway requests block explorers, so why it doesn't requests node for new block or network status when interblock time is about to expire.

evd0kim commented 3 years ago

While I was writing previous message Eclair was on the same height. Then it jumped from 682717 to 682722 in getinfo.

I attempted to measure python-zmq simple code vs Eclair.

Python recieved 682725 and 682726 headers with 20 sec delay (to block timestamp as it showed by Core) for 682726. While I'm writing this my Eclair is still on 682724.

evd0kim commented 3 years ago

An output of Python zmq script with timestamps when it receives headers of 25 and 26 blocks.

- RAW BLOCK HEADER (39) - 2021-05-09 09:41:37.844825
b'0000c020214f40f0d9e1b51b9eafa74b942f676d8688e221bc820d000000000000000000124edcb793b847eeee8a1b70aeb264b675ff95ab3dc721a75b4f092f1b7b5cd739ae976063a80d17040675d7'
- RAW BLOCK HEADER (40) - 2021-05-09 09:41:43.497660
b'00e0ff27db405edafde0264e08075407d02ab1c77a9b44adc1e20900000000000000000033f35a99befa81b5c435961f8896fb21834a02f992adfcf1639e86216951ad5241ae976063a80d17014a0ed9'

An output of getinfo after that

{
  "version": "0.5.2-SNAPSHOT-90fbcd3",
  "nodeId": "02cd1b7bc418fac2dc99f0ba350d60fa6c45fde5ab6017ee14df6425df485fb1dd",
  "alias": "SATM",
  "color": "#b00b69",
  "features": {
    "activated": {
      "gossip_queries": "optional",
      "payment_secret": "optional",
      "option_data_loss_protect": "optional",
      "option_static_remotekey": "optional",
      "basic_mpp": "optional",
      "gossip_queries_ex": "optional",
      "option_support_large_channel": "optional",
      "var_onion_optin": "optional"
    },
    "unknown": []
  },
  "chainHash": "6fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d6190000000000",
  "network": "mainnet",
  "blockHeight": 682724,
  "publicAddresses": [
    "134.209.228.207:80"
  ],
  "instanceId": "b5e26f83-98bc-4bb3-8b41-88c565322841"
}
t-bast commented 3 years ago

The point is, why couldn't Eclair actively request bitcoind node for block? Because I see such behaviour that ZMQ could probably silently dropped, Eclair finds out it is late on the tip and peers disconnect, then Eclair doesn't do anything, just waiting for blocks or restart to reconnect to ZMQ again.

ZMQ should just work. There is no reason to add additional code and maintain it, the focus should instead simply be on making sure ZMQ works properly.

I have strong suggestion that I needed to instal ZMQ libs on Eclair instance but I also understand that Eclair code must be self-contaned and do not need any dependencies.

That would be very surprising. As you say, eclair is self-contained and shouldn't need this (and this is the first time we encounter this kind of issue).

Have you looked at the eclair logs to understand the issue? There should be a hint there, we never had any reports concerning issues with ZMQ subscriptions.

evd0kim commented 3 years ago

@t-bast

Thanks for response. When my Eclair node had significant lag I looked into logs and havent found anything related to zmq or bitcoind. I could only restart it so it was reconnecting then and syncing again.

I have many logs to research because I'm very interested why I had 4 force-closed channels but I need time for that.

For ZMQ I just found an alternative way to test if it works. And I see that Python script receives block headers while Eclair is always lagging. Always, at least 4-5 blocks. Nowadays it at least tries to keep up. Sadly, I can't certainly indentify way I actually fixed it since it all was stressfull and in a hurry. Besides of installing dependencies for python I also found a way to make tcp sending keep alive packets, like this

sudo sysctl -w net.ipv4.tcp_keepalive_time=600

And I enabled this also on Eclair and Bitcoin Core as well.

t-bast commented 3 years ago

I suspect that this is just a matter of machine configuration and/or ZMQ configuration at the bitcoind level. Eclair doesn't do anything special, it's really vanilla ZMQ. If eclair doesn't receive events there, it's likely that there is an issue in the socket that transmits them from your bitcoind machine to the eclair one.

Your comparison with the python script isn't comparing the right thing: eclair subscribes to blocks, not just block headers. Maybe try subscribing to blocks on your python script and see if that reproes the problem?

evd0kim commented 3 years ago

Sure I can do that.

evd0kim commented 3 years ago

@t-bast It was rawblock. I was referring to block headers because script printed out block headers.

t-bast commented 3 years ago

The ZMQ logic is very simple in eclair: https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/zmq/ZMQActor.scala

Can you try to activate debug logs (only for ZMQ to not spam your log file, see instructions here to activate logs for a single component) and see if there are log lines saying that blocks are received (which would indicate that the issue is not in ZMQ but then in eclair when the block is analyzed)?

evd0kim commented 3 years ago

Ok. I will do that.

t-bast commented 3 years ago

Thanks, this will let us know what the next debugging step will be.

If you see the log.debug("received blockid={}", block.blockId) log line then the issue is inside eclair when we're analyzing the block, otherwise the issue is in the transmission between bitcoind and eclair (and then we can check the bitcoind logs and focus on tweaking the connection parameters to figure out how to make it work).

evd0kim commented 3 years ago

What I have. Eclair restart moment

2021-05-10 15:09:25,974 INFO  fr.acinq.eclair.Setup - current blockchain height=682947

meanwhile watchdog reports

2021-05-10 15:11:32,086 INFO  akka.actor.ActorCell - blockstream.info: we are 0 blocks late
2021-05-10 15:11:32,460 INFO  akka.actor.ActorCell - mempool.space: we are 0 blocks late
2021-05-10 15:11:32,639 INFO  akka.actor.ActorCell - blockcypher.com: we are 0 blocks late
2021-05-10 15:11:33,063 INFO  akka.actor.ActorCell - bitcoinheaders.net: we are 0 blocks late

40 mins later it is still 682947 and node height is 682955, I didn't see any ZMQ message other than with txid.

60 mins later Eclair tip is 682957 and it received one block ID.

2021-05-10 16:10:09,366 DEBUG f.a.e.b.b.zmq.ZMQActor - received blockid=0000000000000000000bbdb910e21056bdba946d1684fcc16e584914cc3cde1a

This is id of 682956 and not 682957 block. Why it happens? I see that it recieves constant flow of rawtx messages and likely it can't keep up processing them all so it maybe a bottleneck for actor. Maybe it makes sense to assign receiving an processing blocks to another actor?

After this ZMQ message watchdog signalled that everything is fine.

2021-05-10 16:11:20,540 INFO  akka.actor.ActorCell - blockstream.info: we are 0 blocks late
2021-05-10 16:11:20,649 INFO  akka.actor.ActorCell - blockcypher.com: we are 0 blocks late
2021-05-10 16:11:21,121 INFO  akka.actor.ActorCell - mempool.space: we are 0 blocks late
2021-05-10 16:11:23,110 INFO  akka.actor.ActorCell - bitcoinheaders.net: we are 0 blocks late

I understand that you do not want to bloat code but simultaneous request to Node's Bitcoin Core looks to me very logical and maybe it needs to be done a bit more frequently.

akumaigorodski commented 3 years ago

I've just had a quick look at zmq stuff and noticed there are two instances of ZMQActor which do the same work here: https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/scala/fr/acinq/eclair/Setup.scala#L238-L239

Is this intentional?

t-bast commented 3 years ago

40 mins later it is still 682947 and node height is 682955, I didn't see any ZMQ message other than with txid.

Ok so the issue is not that eclair is overloaded, it's that it just doesn't receive the information. That's what needs fixing, requesting the information by another mechanism would just be a work-around (and if there's a bandwidth issue between the eclair machine and the bitcoin machine, adding more requests would probably make it just worse).

To be honest I don't know how to investigate that, I'll have to do some searching. There should be a way to monitor on your bitcoind machine at what rate data is sent to the ZMQ connection. Your network/machine setup may also be responsible for some throttling. Can you check CPU/memory/bandwidth metrics per-process and try to isolate where the machine seems overloaded?

Is there latency between your eclair and bitcoin machines? What's the exact network setup? Do note that you really should make sure these two machines communicate very easily as they must exchange a lot of data to operate correctly. It's usually even recommended to run eclair and bitcoin on the same machine if you care about performance.

Is this intentional?

Yes, this is the recommended way of using ZMQ, one for the tx subscription and one for the block subscription.

akumaigorodski commented 3 years ago

Yes, this is the recommended way of using ZMQ, one for the tx subscription and one for the block subscription.

But both of those instances subscribe to both rawblock and rawtx: https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/zmq/ZMQActor.scala#L43-L44

I've checked that this is true, each event such as new tx gets printed out to console twice instead of once (as it should be?)

t-bast commented 3 years ago

But both of those instances subscribe to both rawblock and rawtx: https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/zmq/ZMQActor.scala#L43-L44

Wow good catch, that is surely a bug indeed! And it means we consume twice the bandwidth necessary, which could be responsible for this issue in a separate machines environment :fearful:

I'll get that cleaned up ASAP.

evd0kim commented 3 years ago

Ok so the issue is not that eclair is overloaded, it's that it just doesn't receive the information.

With all respect, python code on the same host recieves blocks.

t-bast commented 3 years ago

With all respect, python code on the same host recieves blocks.

But that's very likely because it doesn't configure the ZMQ socket the same way under the hood... Do you know how exactly it connects to ZMQ? Does it use inproc like we do? That would be valuable information to troubleshoot.

And can you provide more details about the exact network setup? Without a repro this is quite hard to troubleshoot.

evd0kim commented 3 years ago

I will try to look into that as we go further but I can't promise that I will have enough time. But just top-level view is that it relies on system libraries and it maybe pretty efficient legacy code so yes, it can be faster.

evd0kim commented 3 years ago

Hey, @t-bast. My report after update:

Reference logs against Python:

TS is block timestamp, local time.

TS 2021-05-13 12:08:26 GMT +2
- RAW BLOCK (702) - 2021-05-13 10:09:09.610939 000000000000000000084577ef98ffee43884340617de2df48fb7e77c5bc7b29

2021-05-13 10:09:14,398 DEBUG f.a.e.b.b.zmq.ZMQActor - received blockid=000000000000000000084577ef98ffee43884340617de2df48fb7e77c5bc7b29

TS 2021-05-13 12:21:54 GMT +2

- RAW BLOCK (703) - 2021-05-13 10:22:28.553246 000000000000000000014e430aab800b09fa85c0f4fb237d16c686314da9f3c0

2021-05-13 10:22:40,045 DEBUG f.a.e.b.b.zmq.ZMQActor - received blockid=000000000000000000014e430aab800b09fa85c0f4fb237d16c686314da9f3c0

TS 2021-05-13 12:22:19 GMT +2
- RAW BLOCK (704) - 2021-05-13 10:22:48.211713 00000000000000000003f34e9732ffe5339046511002987a9ba224f5f5162ea5

2021-05-13 10:33:18,762 DEBUG f.a.e.b.b.zmq.ZMQActor - received blockid=00000000000000000003f34e9732ffe5339046511002987a9ba224f5f5162ea5

TS 2021-05-13 12:24:13 GMT +2
- RAW BLOCK (705) - 2021-05-13 10:24:38.525322 00000000000000000006b43ab544b1b55c72fea8ae35ec3df89cc1f508d57b73

2021-05-13 10:33:56,195 DEBUG f.a.e.b.b.zmq.ZMQActor - received blockid=00000000000000000006b43ab544b1b55c72fea8ae35ec3df89cc1f508d57b73

Not received by Eclair at the moment of report

- RAW BLOCK (706) - 2021-05-13 10:36:46.187801
0000000000000000000a19967f588a7e1b3bcd8e89acf7bc4a02f98af3f7826b

As you can see Eclair timestamps vary significantly. However it shouldn't be so critical anymore. Also before my node was very slow in sending payments. I suspect it was due to synchronization lag. I haven't tested payments yet.

evd0kim commented 3 years ago

By the way @t-bast. Could you please recommend minimal requirements for JVM and an example how to provide them into run-eclair.sh?

t-bast commented 3 years ago

Have you looked at performance metrics for your node (CPU, memory, etc)? Eclair exports a lot of these metrics, and you should have a look at system metrics as well. That would be a useful information to see where there are issues.

What kind of machine are you using for eclair? Are there other processes running on that machine?

You can pass arguments to the JVM using this syntax: https://github.com/ACINQ/eclair/blob/55a629f11de5fc10066016978eae8a417e427858/eclair-node/src/main/resources/eclair-node.sh#L326

We strongly recommend the OpenJDK 11: https://github.com/acinq/eclair#installing-eclair, can you check what you have on your machine?

evd0kim commented 3 years ago

I'm aware of this line in run-eclair, I just asked for examples.

Node runs on 5-dollar Debian instance on Digital Ocean. 1 CPU, 1 GB RAM, OpenJDK 11. Only postgres runs together with it. I'm controlling node performance independently from DO dashboard. On every new channel I have a screenshot for CPU/RAM/IO and network activity.

But it seems computational characteristics of the node VPS are pretty irrelevant. Eclair works with Bitcoin Core unreliably, here is why.

After measuring effect of recent updates the Core has disconnected from Eclair for one block. After that the connection was restored and checked by running Python ZMQ script. Eclair haven't restored connection of any type and it seems ZMQ actor just stopped working.

Eclair wasn't receiving blocks during 12 hours or so until I forcefully reset node. Watchdog was silent and I suppose the sole purpose of this actor is to protect against Eclipse attack... which is pretty bad because why not to do it independent and to add functionality like resetting ZMQ actor.

t-bast commented 3 years ago

I'm aware of this line in run-eclair, I just asked for examples.

I don't know what kind of example you want, this is very hand-wavy so what do you expect...? It simply depends on what you want to tweak based on what type of node you run and what traffic you expect :confused:

For example on our nodes we usually increase the memory allocated to the JVM if we have a big enough machine (e.g. -J-Xmx8G). It depends on your machine ¯ (ツ)

Node runs on 5-dollar Debian instance on Digital Ocean. 1 CPU, 1 GB RAM, OpenJDK 11. Only postgres runs together with it.

This is likely too small. Eclair is designed for aggressive parallelism (and for big nodes), so having a single CPU defeats the whole design (and it's even likely a vCPU). Only 1GB of RAM is probably a limiting factor as well, especially if you're running postgres on the same machine. You will likely have IO issues as well if postgres is on the same machine and its drive is slow (you didn't mention what type of drive is used).

To be honest, with such a small machine, you must investigate performance in more details. Have you setup monitoring? This would let you see exactly what actors are under load, how much time is spent in each actor's mailbox, etc. We can't really "blind debug" performance issue, that's why we have a lot of metrics to easily zoom in on problems.

Eclair wasn't receiving blocks during 12 hours or so until I forcefully reset node. Watchdog was silent and I suppose the sole purpose of this actor is to protect against Eclipse attack... which is pretty bad because why not to do it independent and to add functionality like resetting ZMQ actor.

You're focusing too much on the effect, not the cause. What we need to fix is the root cause, not add bandages on top of something that doesn't work. Again, we need more details about the exact bottlenecks to figure out if something needs fixing, or if your mailboxes are simply full because the machine isn't powerful enough. The fact that your python script receives the events seems to indicate that there may be no issues at the socket/network layer (unless it's configured differently, and maybe we can tweak how the socket is created in eclair) so the issue may be in the actor mailbox, which we'd know from the performance metrics.

evd0kim commented 3 years ago

@t-bast with all respect, you are wrong.

I understand that maybe I'm not welcomed here because of weak VPS but I have records.

So this is stats for 7 days of Eclair working with 2 channels. As you can see the node is working pretty fine.

eclair-stat-2-channels-7-days

Current RAM stat (also 2 channels)

             total        used        free      shared  buff/cache   available
Mem:        1013660      572644       96372       68152      344644      231036
Swap:       2097148       23040     2074108

But this is not an argument actually, because Eclair has multiple problems: BlockchainWatchdog doesn't do anything because it sleeps, and ZMQ actor can't realize that the connection dropped in the meantime.

Some people suggested to look at ZMQ documentation and I see that some common patterns aren't working in Eclair.

t-bast commented 3 years ago

@t-bast with all respect, you are wrong.

No worries, I'm happy to be proven wrong if that lets us fix issues and improve eclair. I do need data to back that up though as I cannot do any meaningful investigation without it (especially since you're the only node that reported this kind of issue in years).

Your perf metrics show that there is no system-wide obvious bottleneck, which is great. That allows us to make some progress on the investigation. What would be great now would be to have eclair metrics as suggested here, it would be a breeze to troubleshoot if we had that (in particular I'd like to see mailbox states to know if the ZMQ actor is overloaded or if it's just not receiving the events). We've never implemented the prometheus export (because no-one asked for it yet) but if you want to work on that it would be a good contribution, otherwise you can use Kamon to get something setup quickly.

Some people suggested to look at ZMQ documentation and I see that some common patterns aren't working in Eclair.

Can you be specific? Hand-wavy comments like that aren't helpful at all if we want to avoid wasting everyone's time.

evd0kim commented 3 years ago

Okay, I will try to enable Kamon. In my setup Core is true remote node. Even geographically. However, Eclair and Core work under Wireguard VPN which basically makes setup similar as if they were on the same powerful machine with one notable exception: network failures.

Can you be specific? Hand-wavy comments like that aren't helpful at all if we want to avoid wasting everyone's time.

Sorry, currently I can't. I was only using ZMQ in my projects for some time and they were Python prototypes.

evd0kim commented 3 years ago

@t-bast What particular metrics would you suggest to track in the context of this issue?

t-bast commented 3 years ago

Do you know if bitcoin core would log such ZMQ failures on its side (since it's the bitcoin core side that sends messages)? I'll check to see what logs we can have from ZMQ on the eclair side, but we'd probably need to create a custom branch to enable more logs.

The metrics I'm interested in are actors time in mailbox. If these mailboxes are full, it would point to a perf issue in eclair's treatment of ZMQ messages. If they're not, your idea of network drops that aren't correctly caught by our config of ZMQ would be most likely.

evd0kim commented 3 years ago

image

evd0kim commented 3 years ago

Do you know if bitcoin core would log such ZMQ failures on its side (since it's the bitcoin core side that sends messages)?

In my particular situation I think it is more like this.

Pub-sub: if the client dies (having gotten some data), the server doesn’t know about it.

So Core just cant receive new blocks because of disrupted network and it cant provide new blocks via ZMQ while it doesn't track if client receives anything.

t-bast commented 3 years ago

Thanks for the metrics. Indeed your mailboxes don't seem congested at all, so we can rule out the perf issue on the eclair side. Given your system metrics and eclair metrics, your theory of a "stuck" connection because of a network drop sounds much more likely, but I'm very surprised nobody encountered this before (or maybe nobody noticed it, which is even more worrying).

I'll try to simulate a flaky ZMQ connection to see if I can reproduce, I haven't used it much so I thought it would handle this more nicely out of the box, but we probably need to apply some configuration changes...meanwhile if you're able to figure out how your python prototype setups the connection and if it does something specific, it would be helpful.

t-bast commented 3 years ago

You're not seeing this log line, are you?

https://github.com/ACINQ/eclair/blob/898c17bc76c1af9077314958612ad96c019bd381/eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/zmq/ZMQActor.scala#L85

t-bast commented 3 years ago

Maybe what you're seeing is similar to https://github.com/zeromq/libzmq/issues/1199 It's a bit unclear from this issue what the best practice is to fix this though, still needs some digging.

It could also be linked to the watermark, as we leave default values: https://github.com/zeromq/jeromq/wiki It looks like this is going to be a nice rabbit hole :smile:

t-bast commented 3 years ago

Can you try adding this line:

subscriber.setTCPKeepAlive(1)

Below the following line: https://github.com/ACINQ/eclair/blob/898c17bc76c1af9077314958612ad96c019bd381/eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/zmq/ZMQActor.scala#L43

It could be a very simple fix if keep-alive is what's missing.

evd0kim commented 3 years ago

I found some strings in old logs

2021-05-09 05:21:47,939 WARN  f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
2021-05-09 05:21:48,119 WARN  f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
2021-05-09 09:57:23,827 WARN  f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
2021-05-09 09:57:23,955 WARN  f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000

So yes, it detects disruption.

evd0kim commented 3 years ago

I will add this string and rebuild. Then report back.

Regarding ZMQ experience I can say for sure that on good hosting there is no problem with it usually. I had LND about 2 or 3 years ago on Google and It worked just fine.

evd0kim commented 3 years ago

So, now Eclair works fine. I've simulated disruption manually.

2021-05-15 11:19:21,926 INFO  f.a.e.b.b.zmq.ZMQActor - connected to tcp://192.168.2.7:29000
2021-05-15 11:19:21,936 INFO  f.a.e.b.b.zmq.ZMQActor - connected to tcp://192.168.2.7:29000
2021-05-15 15:15:49,546 WARN  f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
2021-05-15 15:16:05,865 WARN  f.a.e.b.b.zmq.ZMQActor - disconnected from tcp://192.168.2.7:29000
2021-05-15 15:27:33,925 INFO  f.a.e.b.b.zmq.ZMQActor - connected to tcp://192.168.2.7:29000
2021-05-15 15:27:41,978 INFO  f.a.e.b.b.zmq.ZMQActor - connected to tcp://192.168.2.7:29000

ZMQ restored connection and downloaded missed blocks.

There were 3 blocks. eclair-zmq-test

After receiving 3 blocks, Eclair then continued to work as it does (1 block). eclair-zmq-test-normalized

The disruption weren't caught by Watchdog. I see it is normal because ZMQ was restored 1 min before Watchdog started requesting explorers.

2021-05-15 14:54:39,571 INFO  akka.actor.ActorCell - blockcypher.com: we are 0 blocks late
2021-05-15 14:54:40,100 INFO  akka.actor.ActorCell - bitcoinheaders.net: we are 0 blocks late
2021-05-15 15:29:20,933 INFO  akka.actor.ActorCell - blockcypher.com: we are 0 blocks late
2021-05-15 15:29:21,452 INFO  akka.actor.ActorCell - mempool.space: we are 0 blocks late
2021-05-15 15:29:21,992 INFO  akka.actor.ActorCell - bitcoinheaders.net: we are 0 blocks late
2021-05-15 15:29:22,438 INFO  akka.actor.ActorCell - blockstream.info: we are 0 blocks late
2021-05-15 15:38:18,149 INFO  akka.actor.ActorCell - blockcypher.com: we are 0 blocks late
2021-05-15 15:38:18,151 INFO  akka.actor.ActorCell - blockstream.info: we are 0 blocks late
2021-05-15 15:38:19,132 INFO  akka.actor.ActorCell - mempool.space: we are 0 blocks late
2021-05-15 15:38:19,318 INFO  akka.actor.ActorCell - bitcoinheaders.net: we are 0 blocks late
t-bast commented 3 years ago

Great, so that means adding that keep-alive was the missing bit? I don't understand why it's not on by default to be honest, I really thought that would be included in the default behavior.

Let's keep it running for a few days to make sure it's the culprit, and if everything works fine I'll open a PR to add this, and I'll try to add some flaky connection tests as well.

t-bast commented 3 years ago

It looks like the latency/throughput isn't great between your eclair node and bitcoind node (8 seconds for the checkconfirmed calls seems a lot), you should have metrics that check the exact latency for RPC calls.

It's not critical though, if you don't have many channels it doesn't matter that much if there's latency there.

evd0kim commented 3 years ago

Well, there are two levels of improvement to me

1) eclair works better with subscription fix 2) eclair is reconnecting. Maybe I will test it additionally or occasionally next days

Latency varies indeed. But I see a lot of potential in using remote node wallet for onchain operations. In many situations it may be beneficial from security point of view. Maybe even combine it with HSM however, I won't be considering it for my needs in near future.