ACINQ / eclair

A scala implementation of the Lightning Network.
Apache License 2.0
1.24k stars 266 forks source link

Unable to decode onion error message from eclair to lnd #363

Closed Roasbeef closed 6 years ago

Roasbeef commented 6 years ago

Lately, when I've been trying to send to/through starblocks on some of my nodes, my HTLC's keep getting rejected. I haven't been able to determine why exactly, as I can't decode the error message sent by eclair:

2018-01-08 03:31:50.032 [DBG] PEER: Received UpdateFailHTLC(chan_id=04c37f4471d3e994c1bb455b9cb976959149e0b9bda1a0d377a2df14574c2e36, id=146, reason=06345bbb0e4e0d441abbd994c3713ee43c409b1856b9727b3b518f123d4b5eaf3f0272ddf0b44355951d71bc1fc61b0fbd6a1ba9cc5ab1591f1319f1230d244f61537ea925115271f2aac7b92298b35b7cc88cca6581ed44596cc58196e7111e53647863986acee00e0bd007335c373d82e6544476cdd44b7b0a0b1152d03aad40abf562e235ba7770c55244198e35640be0bed4ca91d8f472c39bdc820616b01a487e71ab7d858a63b9d1ed4f2a12715e1e488769b4cb055f6da10bf19ad3e46644cb6ad0572e140657516ec32ac7cc40b1b969773e16fcae505e946f249b3f3095abf810790bdb5720c576ce433917b76ed70d1b046a09741f3b032d7da4ce877ecdb2939891d56896a1381cafacf276a4bbe0209880d901ed98fc0520bada9225261c) from 
2018-01-08 03:31:50.043 [DBG] PEER: Received CommitSig(chan_id=04c37f4471d3e994c1bb455b9cb976959149e0b9bda1a0d377a2df14574c2e36, num_htlcs=0) from 
2018-01-08 03:31:50.050 [DBG] PEER: Sending RevokeAndAck(chan_id=04c37f4471d3e994c1bb455b9cb976959149e0b9bda1a0d377a2df14574c2e36, rev=4564ef7c7de66b5348820f993ab18cc21d4e11c1a8327518ccd4a596fe484e87, next_point=037346b4c008d039b46a2f577d4833a80133d91a29af1e03ae62e0fb191b571126) to 2018-01-08 03:31:50.059 [DBG] PEER: Sending CommitSig(chan_id=04c37f4471d3e994c1bb455b9cb976959149e0b9bda1a0d377a2df14574c2e36, num_htlcs=0) to 
2018-01-08 03:31:50.162 [DBG] PEER: Received RevokeAndAck(chan_id=04c37f4471d3e994c1bb455b9cb976959149e0b9bda1a0d377a2df14574c2e36, rev=379f27413b8a5154a4731a44a2d261368ce69885c363c4032e0a1af5bb6ae2f3, next_point=037b81d3eceb4fdfd3fca0fb8dad244ce1e670ee339266836a548ca5868221c5ab) from 
2018-01-08 03:31:50.165 [DBG] HSWC: ChannelPoint(372e4c5714dfa277d3a0a1bdb9e049919576b99c5b45bbc194e9d371447fc304:1) forwarding 1 HTLC's
2018-01-08 03:31:50.165 [DBG] HSWC: Closed completed onion circuit for 7740f2c4715c705ab2d0211d06ae67a37fa0e95338e8f73d7f424fb13ab86604: (0:0:0, 220) <-> (1257786:147:1, 146)
2018-01-08 03:31:50.168 [ERR] HSWC: unable to de-obfuscate onion failure, htlc with hash(7740f2c4715c705ab2d0211d06ae67a37fa0e95338e8f73d7f424fb13ab86604): unable to decode error update (type=*lnwire.FailTemporaryChannelFailure, len_bytes=130, bytes=1007b42a4030b07a3f456093a3020b980c1f54a7b15d663d122360823091945ac55d6941f4e3c0bccc2339020993008ac4024c6af59d9a4a4074b3f12fb1fae93a1543497fd7f826957108f4a30fd9cec3aeba79972084e90ead01ea3309000000001329ca00006200005a526d580001009000000000000027100000271000000064): unexpected EOF

lnd determines it's a FailTemporaryChannelFailure, then tries to decode the message:

1007b42a4030b07a3f456093a3020b980c1f54a7b15d663d122360823091945ac55d6941f4e3c0bccc2339020993008ac4024c6af59d9a4a4074b3f12fb1fae93a1543497fd7f826957108f4a30fd9cec3aeba79972084e90ead01ea3309000000001329ca00006200005a526d580001009000000000000027100000271000000064

Are y'all able to decode this?

Roasbeef commented 6 years ago

Hmm, actually if I just try to decode the raw error (instead of along with its framing), I'm able to properly decode it:

len:  4103
(*lnwire.FailTemporaryChannelFailure)(0xc42000e0b8)(TemporaryChannelFailure(update=(*lnwire.ChannelUpdate)(0xc420082370)({
 Signature: (*btcec.Signature)(0xc4200107b0)({
  R: (*big.Int)(0xc42000d060)(81490963346137529563603176940925655100903038764901586889101343776863355061597),
  S: (*big.Int)(0xc42000d080)(47609384326767702852395815807248248468603241552974167823125523079358952585749)
 }),
 ChainHash: (chainhash.Hash) (len=32 cap=32) 000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943,
 ShortChannelID: (lnwire.ShortChannelID) 1255882:98:0,
 Timestamp: (uint32) 1515351384,
 Flags: (lnwire.ChanUpdateFlag) 1,
 TimeLockDelta: (uint16) 144,
 HtlcMinimumMsat: (lnwire.MilliSatoshi) 10000 mSAT,
 BaseFee: (uint32) 10000,
 FeeRate: (uint32) 100
})
))

The odd thing here still though, is that the prepended length of the update was 4103 bytes, when the update itself is just 130 bytes.

sstone commented 6 years ago

We're missing the len field in our error message: we send 1007 | channelUpdate instead of 1007 | len | channelUpdate so the first 2 bytes you interpret as length are actually the first 2 bytes of the channel update's signature field.

The correct encoding should be:

10070080b42a4030b07a3f456093a3020b980c1f54a7b15d663d122360823091945ac55d6941f4e3c0bccc2339020993008ac4024c6af59d9a4a4074b3f12fb1fae93a1543497fd7f826957108f4a30fd9cec3aeba79972084e90ead01ea3309000000001329ca00006200005a526d580001009000000000000027100000271000000064
sstone commented 6 years ago

Ok, with a correct encoding lnd crashes when it receives the error (see log below). The parsed temporary failure message looks good (i.e. matches what was sent). I'm using lnd at https://github.com/lightningnetwork/lnd/commit/beeb75cb5fdd0f21d747fad64abdfd7ffbd08cb4


2018-01-10 16:40:56.082 [ERR] CRTR: Attempt to send payment abadffbc5695a4dc400a5a227b44da8231b7ee2b6f6bd9340b49fd8eb0eb3577 failed: TemporaryChannelFailure(update=(*lnwire.ChannelUpdate)(0xc420272dc0)({
 Signature: (*btcec.Signature)(0xc420678470)({
  R: (*big.Int)(0xc4205374a0)(49491959598261878261056428276442350416278241260864900079909248898946334365505),
  S: (*big.Int)(0xc4205374c0)(13460190816797932915642861930206119487212107405213384283682323122815355044768)
 }),
 ChainHash: (chainhash.Hash) (len=32 cap=32) 0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206,
 ShortChannelID: (lnwire.ShortChannelID) 592:1:0,
 Timestamp: (uint32) 1515598737,
 Flags: (lnwire.ChanUpdateFlag) 0,
 TimeLockDelta: (uint16) 144,
 HtlcMinimumMsat: (lnwire.MilliSatoshi) 1000 mSAT,
 BaseFee: (uint32) 10000,
 FeeRate: (uint32) 100
})
)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xace5cb]

goroutine 244 [running]:
github.com/lightningnetwork/lnd/routing.(*Route).nextHopChannel(0xc4205c9310, 0xc420471f20, 0x0, 0x0)
    /home/fabrice/go/src/github.com/lightningnetwork/lnd/routing/pathfind.go:154 +0x8b
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).SendPayment(0xc420214210, 0xc420233c00, 0x0, 0x0, 0x0, 0x0, 0xc420555796, 0xc420555790, 0x0)
    /home/fabrice/go/src/github.com/lightningnetwork/lnd/routing/router.go:1651 +0xa52
main.(*rpcServer).SendPayment.func3(0xc42069a960, 0xc4205cb960, 0xc420232d40, 0xc4205cba20, 0xc42013cba0, 0x143f860, 0xc42014ca30, 0xc42069a900)
    /home/fabrice/go/src/github.com/lightningnetwork/lnd/rpcserver.go:1802 +0x14b
created by main.(*rpcServer).SendPayment
    /home/fabrice/go/src/github.com/lightningnetwork/lnd/rpcserver.go:1831 +0x5ad
Roasbeef commented 6 years ago

If you run that test again with lnd on tracelevel logging (--debuglevel=CRTR=trace) for the router, can you paste the log dump? Thanks!

Roasbeef commented 6 years ago

Working on a fix on our end. One question: this was triggered as we detected that the destination sent a temp chan failure, but there's an assumption in the code atm, that only any of the intermediate nodes will ever send that error (so we then try to prune that outgoing channel).

In this case the dest in starblocks, so why's it sending a temp chan failure if the HTLC reached it?

Roasbeef commented 6 years ago

Or is it failing because it has too many incoming/accepted HTLC's already? https://github.com/ACINQ/eclair/blob/a3bdf52a2f8e5697b03b238ed1ca0581fe88381d/eclair-core/src/main/scala/fr/acinq/eclair/payment/Relayer.scala#L143

Roasbeef commented 6 years ago

Closed by mistake...

pm47 commented 6 years ago

The failure is sent by endurance, not starblocks.

Reason is that the channel is depleted (there are actually other non-depleted channels between endurance and starblocks)

pm47 commented 6 years ago

I'm curious why you think that the destination sent the error?

Roasbeef commented 6 years ago

I'm curious why you think that the destination sent the error?

That was just a hunch without the full logs.

Reason is that the channel is depleted (there are actually other non-depleted channels between endurance and starblocks)

If multiple channels are active, and an incoming HTLC request comes along, eclair will not just go ahead an utilize the available channel even if it wasn't the one specified in the onion payload?

pm47 commented 6 years ago

If multiple channels are active, and an incoming HTLC request comes along, eclair will not just go ahead an utilize the available channel even if it wasn't the one specified in the onion payload?

There is a TODO for that in the relayer ;-)

Roasbeef commented 6 years ago

I'm curious why you think that the destination sent the error?

Looked into it a bit more, and the crash above happend as we detected that the destination sent the error. I say this as the only way that the map lookup that led to the crash can fail, is if the destination sent the error. Still investigating on my side, but adding a patch to address this edge case for now.

sstone commented 6 years ago

Yes, the crash above happens when it is the destination sends the failure message, not when it's a relaying a node. I've run the following tests locally: eclair 1 -- eclair 2 -- lnd, eclair 1 fails the HTLC with a TemporaryChannelFailure => lnd crashes eclair 1 -- eclair 2 -- lnd, eclair 2 fails the HTLC with a TemporaryChannelFailure => lnd logs the error and remains functional In both cases the failure message is logged properly