ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.84k stars 901 forks source link

Inconsistent/incorrect channel IDs - can't receive or route any payments #4386

Closed 3nprob closed 3 years ago

3nprob commented 3 years ago

Issue and Steps to Reproduce

Context/background

This is following on after #4371 stopped crashing clightning.

After this, I attempted to create som invoices to receive some payments from another node. A couple smaller ones went through, but after this I am now unable to route (32 failed ones), send or receive any payments.

I see the following in logs (appearing over and over):

INFO    plugin-watchtower.py: There is a subscription issue with REDACTED. Adding appointment to pending
INFO    plugin-watchtower.py: Adding tower to Tower's db (id=REDACTED)
INFO    plugin-watchtower.py: There is a subscription issue with REDACTED. Adding appointment to pending
INFO    plugin-watchtower.py: Adding tower to Tower's db (id=REDACTED)
INFO    plugin-watchtower.py: There is a subscription issue with REDACTED. Adding appointment to pending
INFO    plugin-watchtower.py: Adding tower to Tower's db (id=REDACTED)
UNUSUAL REACTED-channeld-chan#105: Adding HTLC too slow: killing connection
INFO    REACTED-chan#105: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (9)
INFO    plugin-watchtower.py: There is a subscription issue with REDACTED. Adding appointment to pending
INFO    plugin-watchtower.py: Adding tower to Tower's db (id=REDACTED)
INFO    plugin-watchtower.py: There is a subscription issue with REDACTED. Adding appointment to pending
INFO    plugin-watchtower.py: Adding tower to Tower's db (id=REDACTED)
INFO    plugin-watchtower.py: There is a subscription issue with REDACTED. Adding appointment to pending
INFO    plugin-watchtower.py: Adding tower to Tower's db (id=REDACTED)
INFO    plugin-watchtower.py: There is a subscription issue with REDACTED. Adding appointment to pending
INFO    plugin-watchtower.py: Adding tower to Tower's db (id=REDACTED)
UNUSUAL REACTED-channeld-chan#105: Adding HTLC too slow: killing connection

Watchtower (python-teos plugin) looks inactive but healthy:

15/02/2021 01:47:12 [Responder] New block received  (block_hash=0000000000000000000792b590d77c81a30a414aaa9183c2ac1ac4859e3d3c32, prev_block_hash=000000000000000000066a5e112dc95694c4a7a783d2e09d9f12dd77abb491f6)
15/02/2021 01:54:12 [ChainMonitor] New block received via polling  (block_hash=0000000000000000000a09af0e7841b2168ad1b2c1286753d0f8dd02fd868e81)
15/02/2021 01:54:12 [Responder] New block received  (block_hash=0000000000000000000a09af0e7841b2168ad1b2c1286753d0f8dd02fd868e81, prev_block_hash=0000000000000000000792b590d77c81a30a414aaa9183c2ac1ac4859e3d3c32)
15/02/2021 01:54:12 [Watcher] New block received  (block_hash=0000000000000000000a09af0e7841b2168ad1b2c1286753d0f8dd02fd868e81, prev_block_hash=0000000000000000000792b590d77c81a30a414aaa9183c2ac1ac4859e3d3c32)
15/02/2021 01:54:12 [Watcher] No breaches found
[...]

Now, I note that despite that I have several connected and remote-funded public channels when I create an invoice on the node, I only get a single route hint:

$ BOLT=$(lightning-cli invoice -k msatoshi=any -k label='test'  -k description='test')
$ lightning-cli decodepay $BOLT
{
  [...]
   "routes": [
      [
         {
            "pubkey": "REDACTED",
            "short_channel_id": "REDACTED",
            "fee_base_msat": 1000,
            "fee_proportional_millionths": 1,
            "cltv_expiry_delta": 40
         }
      ]
   ]
}

Even when explicitly specifying the channels I want to use, it only includes this single one:

$ BOLT=$(lightning-cli invoice -k msatoshi=any -k label='test'  -k description='test' -k exposeprivatechannels='["...","...","...", "..."]')
$ lightning-cli decodepay $BOLT
{
  [...]
   "routes": [
      [
         {
            "pubkey": "REDACTED",
            "short_channel_id": "REDACTED",
            "fee_base_msat": 1000,
            "fee_proportional_millionths": 1,
            "cltv_expiry_delta": 40
         }
      ]
   ]
}

Now, the really odd thing here: When I decodepayreq this on an lnd node, I get a different channel ID than what the clightning node has for it. Looking at 1ml.com, it corresponds to the same funding transaction id/utxo and the same peers, but the channel ID is different.

I did have a different channel with this peer in the past, which was closed after which this one was opened. Unfortunately I did not save the ID of that one so I can't say for sure if the ID here corresponds to that one. In either case, clightning seems to ignore all my other channels and has a different idea vs the rest of the world on this channel's ID, despite the UTXO and node ids matching up.

cdecker commented 3 years ago

The old channel should not interfere. I would not worry too much about short-channel-ids not matching across implementations is probably down to us using the 123x456x789 format, whereas lnd uses the integer encoding.

This is the only line that is concerning to me, and would explain the issue:

UNUSUAL REACTED-channeld-chan#105: Adding HTLC too slow: killing connection

It could be that the watchtower plugin is taking a long time to report successful storage of the penalty transaction. Since we start the commitment timer before calling out to the watchtower, it might have timed out before the watchtower is able to respond, and we'd blame the peer for being slow when really it is the watchtower.

You don't have any timestamps on your logs, could you confirm that between calling pay and the error above there is a delay? And could you test without the watchtower for a bit?

3nprob commented 3 years ago

Hmm, actually, no, there is no delay at all - it's instantaneous.

This node has 8 publiched channels to different nodes (4 of which with sufficient remote balance) - shouldn't these be included as hints in the invoice? It remains a single one, the same even if I try to manually include multiple as exposeprivatechannels

3nprob commented 3 years ago

My last attempt here (manually specifying channels for hints) seems like it should be covered by this test case, but I just get back a single element, no wanrings or errors. I'll see if checking the result in the test works or not.

https://github.com/ElementsProject/lightning/pull/3351/files#diff-76c861a3703af36e455001e9724b26eee24263260747f9abb016af4363180873R293

cdecker commented 3 years ago

This node has 8 publiched channels to different nodes (4 of which with sufficient remote balance) - shouldn't these be included as hints in the invoice? It remains a single one, the same even if I try to manually include multiple as exposeprivatechannels

We will include a single routehint in what is sometimes called a routeboost hint: the channel hinted at is public so we wouldn't need to include them at all, but we do so to tell the payer that at the time of the invoice creation (hopefully close to the payment) that particular channel had sufficient balance. The sender can then skew towards selecting that channel with a higher chance. It's really just an optmization, nothing that has to be there.

My last attempt here (manually specifying channels for hints) seems like it should be covered by this test case, but I just get back a single element, no wanrings or errors. I'll see if checking the result in the test works or not.

The parameter acts as a list of potential channels that lightningd may chose to expose, and we then adhere to the above logic of including the preferred one. It does not change the number of routehints included. The rationale is that routehints are rather expensive when encoded into a QR code, and the more data we add to a QR code the harder it is to scan (error correction bits go down), hence we try to strike a balance between data in the QR code and hints to the sender.

Hmm, actually, no, there is no delay at all - it's instantaneous.

In this case it might still be the remote endpoint that is simply too slow (tor, flaky connection, etc) to process our HTLC request in time. Did you try without the watchtower by any chance? In addition you can also lightning-cli disconnect [node_id] true to force a disconnect and see if other channels work better.

3nprob commented 3 years ago

Hm, the watchtower is I think the only thing I haven';t checked into properly yet - I will dig a bit into that and see what gives!

(Watchtowers on both nodes are local to either node with plenty of resource-headroom for what that's wort, which is why i didn't dig deeper yet)

3nprob commented 3 years ago

The Watchtower warnings went away after setting env var GRPC_DNS_RESOLVER=native, so it seems like it was a case of https://github.com/talaia-labs/python-teos/issues/246

Thanks for pointing me in the right direction!

Could be resolved now, but keeping the issue open until I see it working

cdecker commented 3 years ago

Thanks for the update, please keep us posted :+1:

cdecker commented 3 years ago

Closing to reduce clutter in the issue list. Please feel free to reopen if this reappears :+1:

sr-gi commented 3 years ago

Sorry I missed this linked issue completely.

@3nprob was the issue solved after that?

@cdecker sorry this affected you guys.

cdecker commented 3 years ago

No problem @sr-gi, just glad we figured it out in the end 😉