cculianu / Fulcrum

A fast & nimble SPV Server for BCH, BTC, and LTC
Other
343 stars 77 forks source link

python client cannot gracefully close transport (waits forever), it needs to abort #76

Open SomberNight opened 3 years ago

SomberNight commented 3 years ago

An asyncio python client using asyncio.BaseTransport.close() never manages to close a transport (with a Fulcrum server); it must abort it using asyncio.WriteTransport.abort.

We have already talked about this on IRC some time ago, I just wanted to log the issue so it can easily be referenced.

See IRC log:

[2020-12-16 17:38:56 +0100] <ghost43>   NilacTheGrim: oh btw, you might be interested in https://github.com/spesmilo/electrum/commit/e83f0dd3fc4e19c8845fc21f344400f1b0a7eebf  - see commit msg. we have narrowed the issue down to fulcrum servers
[2020-12-16 17:40:59 +0100] <ghost43>   although I appreciate that the interesting and actionable details from your PoV are missing :P  I can only tell you that closing the session with a fulcrum server is slow but not with other servers, on a good network connection
[2020-12-16 21:07:33 +0100] <NilacTheGrim>  ghost43, regarding that Fulcrum issue -- hmm. Not sure why that would be. What exactly is going on there on the socket level?
[2020-12-16 21:07:56 +0100] <NilacTheGrim>  what is interface.close()? just a regular socket close? why would that take so long?
[2020-12-16 21:08:19 +0100] <NilacTheGrim>  i did actually notice this gUI hang thing
[2020-12-16 21:17:43 +0100] <NilacTheGrim>  fulcrum just doesn't receive the close packet until like 10 second delay. i literally have it here on my lan. i tried socat and other tools and it works fine. something is funny in the python lib .. weird that it doesn't happen on E-X though.
[2020-12-16 21:25:27 +0100] <NilacTheGrim>  ghost43, i'm investigating but so far this looks like some sort of bug/out of spec behavior in aiorpcx
[2020-12-16 21:29:09 +0100] <ghost43>   NilacTheGrim: when I looked, it seemed aiorpcx is not doing anything weird -- it is using stdlib asyncio stuff
[2020-12-16 21:29:39 +0100] <NilacTheGrim>  i'm looking now
[2020-12-16 21:30:53 +0100] <NilacTheGrim>  note that Fulcrum works with a plethora of different client implementations (including the old socket stuff Electron Cash still uses), plus javascript clients, explorers. i have never heard of issues closing the connection. I also cm doing it now here .. using C library to close the socket. nothing is weird at all. yet when I use Electrum latest with aiorpcx it does indeed hang.  node that close() can take a timeout parameter so you could also close
[2020-12-16 21:30:53 +0100] <NilacTheGrim>  without having to spawn a new thread group or whatever.
[2020-12-16 21:33:13 +0100] <ghost43>   NilacTheGrim: https://github.com/kyuupichan/aiorpcX/blob/a5f9875f65b63384c33e8a13aa6ab66e8c5bc852/aiorpcx/rawsocket.py#L123
[2020-12-16 21:33:19 +0100] <ghost43>   https://docs.python.org/3/library/asyncio-protocol.html#asyncio.BaseTransport.close
[2020-12-16 21:33:22 +0100] <ghost43>   https://docs.python.org/3/library/asyncio-protocol.html#asyncio.WriteTransport.abort
[2020-12-16 21:34:35 +0100] <ghost43>   self._closed_event gets set when asyncio calls connection_lost
[2020-12-16 21:34:56 +0100] <ghost43>   ISTM the whatever the cause is, it is in asyncio itself
[2020-12-16 21:35:04 +0100] <ghost43>   -the
[2020-12-16 21:37:52 +0100] <NilacTheGrim>  i am not familiar with aiorpcx but it seems to do a bunch of work internally when closing gracefully like cancelling futures and whatnot. stil this should not be fulcrum-specific. so weird man.
[2020-12-16 21:38:12 +0100] <NilacTheGrim>  the thing is so whatever weird stuff is happening -- if it indeed is some out-of-spec issue -- electrumx uses the same lib so it has the same weirdness, lol.
[2020-12-16 21:38:24 +0100] <NilacTheGrim>  let me try against another implementation.. electrs works ok .. right?
[2020-12-16 21:39:31 +0100] <NilacTheGrim>  yeah it does. ok well troubleshooting...
[2020-12-16 21:45:01 +0100] <NilacTheGrim>  ghost43: maybe related: https://bugs.python.org/issue39758
[2020-12-16 21:47:35 +0100] <NilacTheGrim>  this appears tobe a bug in python itself when running against microsoft tcp stacks sometimes not always. lol. guess what? my server is on a microsoft tcp stack. still troubleshooting.  doesn't happen in python 3.6
[2020-12-16 22:08:19 +0100] <NilacTheGrim>  yeah this is definitely a known bug in python asyncio -- and only happens against microsoft tcp stacks!
[2020-12-16 22:08:35 +0100] <NilacTheGrim>  (my server happens to be on ms) -- see the linked issue to python.org
[2020-12-16 22:11:53 +0100] <NilacTheGrim>  ghost43, yep can confirm -- trying against a non-MS TCP stack Fulcrum server, runs ok. Trying against MS-TCP stack Fulcrum hangs. (this is on Electrum with the commit prior to the fix) -- I bet if you were to run ElectrumX on Windows it would have the same issue.
[2020-12-16 22:13:38 +0100] <NilacTheGrim>  (ok also happens in python 3.6 -- definitely a bug in python and/or MS TCP stack interaction -- still unfixed)
[2020-12-16 22:32:52 +0100] <ghost43>   NilacTheGrim: good find, that issue looks related.  what do you mean by ms tcp stack? do you mean running the server on Windows? because I have tested against multiple fulcrum servers and all exhibited it, e.g. btc.electroncash.dk:60002:s, fulcrum.keff.org:50002:s, fulcrum.grey.pw:51002:s
[2020-12-16 22:33:01 +0100] <ghost43>   I doubt those all run on windows
[2020-12-16 22:33:39 +0100] <NilacTheGrim>  those run linux..
[2020-12-16 22:33:44 +0100] <NilacTheGrim>  let me try it now..
[2020-12-16 22:34:42 +0100] <NilacTheGrim>  i didn't realize there's more than 2 Fuclrum servers..
[2020-12-16 22:34:52 +0100] <NilacTheGrim>  electroncash.dk is down for me.. hmm.
[2020-12-16 22:37:34 +0100] <NilacTheGrim>  the only other one that is up for me is fulcrum.grey.pw:51002.. it's happening on that one. the other 2 are down.
[2020-12-16 22:38:12 +0100] <ghost43>   weird, I have just tested connecting to all three and they are up
[2020-12-16 22:39:22 +0100] <NilacTheGrim>  the electroncash.dk one is definitely Linux -- i know the admin he's on Linux -- it's happening now for you there too? I'm on commit hash 19f806ddf47634fe0c821244a62b41c71c4138d3 but cannot connect there at all
[2020-12-16 22:40:07 +0100] <ghost43>   I have just tested whether I can connect to them now; but a few days ago when I committed that I have tested and reproduced the issue on all three
[2020-12-16 22:42:26 +0100] <NilacTheGrim>  so maybe it's not just windows stacks but some other interaction. i was able to get the hang with that test script in the python.org issue...
[2020-12-16 22:51:03 +0100] <NilacTheGrim>  man this socket issue is driving me crazy -- glad you have a workaround. it's worrying.
[2020-12-16 22:51:09 +0100] <NilacTheGrim>  i will try and isolate it further...
[2020-12-16 22:51:32 +0100] <NilacTheGrim>  it definitely is a thing with python SSL tho -- weird that other implementations like electrs are unaffected. will dig further...
[2020-12-16 22:52:35 +0100] <NilacTheGrim>  there's a big huge explanation here too of what they think may be happening with some stacks: https://github.com/python-trio/trio/blob/31e2ae866ad549f1927d45ce073d4f0ea9f12419/trio/_ssl.py#L779-L829
cculianu commented 3 years ago

Ok, I'll look into it. Thanks for opening the issue.

Pretty sure there is some strangeness in one of the python frameworks that you use. The reason why I say this is: I have seen other unrelated projects get bitten by this with my google searches when investigating this.

Here is 1 workaround (linked in your chat log):

https://github.com/python-trio/trio/blob/31e2ae866ad549f1927d45ce073d4f0ea9f12419/trio/_ssl.py#L779-L829

And here is a known issue in Python that may be related that has gone unfixed: https://bugs.python.org/issue39758

But anyway thanks for the report.

cculianu commented 3 years ago

Ok, so I am using wireshark to do a packet capture. Here is a screenshot.

Screen Shot 2021-03-05 at 9 40 12 AM

My localhost Electrum is 192.168.0.101. The remote server is 192.168.0.15 (running Fulcrum on mainnet). I initiated the server switch at Wireshark time ~323.85. I believe Python initiated the close then by sending some TLS-level data and waiting.

However, Python did not send a TCP FIN until ~20 seconds later at time ~353, at which point the Fulcrum side promptly replied with a TCP FIN immediately. (The 20 seconds I presume is the internal timeout being used in Python or in Electrum).

I do believe this issue exists possibly on the SSL layer -- it appears the Python side is "waiting for something" which never arrives. You can't see it here but immediately above when I initiated the close there was some TLS packets exchanged initially and sent from the Python side (it's immediately above the top-most thing visible).

I will investigate further if there is anything I can do on my end but this looks a lot like the Python-asyncio-specific "quirks" linked-to in the issues above... perhaps.

cculianu commented 3 years ago

I believe this is what's happening

I believe Python is following the "old spec" which they changed in TLS 1.1. Now it's not 100% required to wait for the close-notify reply on the TLS layer. But Python side is waiting. And I guess many TLS stacks do behave in the way that Python expects (including the rust-based stack used in Electrs).

The Qt-based TLS stack used in Fulcrum (which ultimately is using openssl) is behaving differently here and confusing Python. I believe it is not responding to the close-notify in all cases, hence the delay before timeout? You'll notice it's considered widespread practice not to respond to that message immediately... and they changed the spec actually to reflect what many implementations do.

From the above mentioned stack exchange thread:

OpenSSL, and thus implementations based on OpenSSL (clients and server), tend to no longer send 
close_notify messages; they just drop the connection. The main reason is that in existing Web servers 
and clients, connections are managed in a pool that closes them after some inactivity delay; that pool 
manages low-level sockets and has no notion of what SSL could be; thus, the SSL layer has no way 
to send (or wait for) an explicit close_notify.

I really do think this is just an unfortunate mix of software following specs differently, specs changing, and some person on the Python asyncio team not fixing bugs as they should.

I do believe the workaround you have developed in Electrum is sufficient and good. You can't hang the app waiting for TLS packets. Isn't Electrum trustless :) ? It's weird if a server can hang you like that anyway -- regardless of Fulcrum's TLS stack.

I'd like to fix this but unfortunately I am not going to rip apart the C code this is all based on to change how TLS is implemented on my side -- as much as i'd like to do that -- it's outside the scope of my application and not worth it. Especially since the workaround you have developed is actually something you should have in your app anyway.

I am going to conclude that the workaround you have developed in Electrum is actually a good idea regardless. And recommend you continue to do it.

Leaving this issue open though, just in case anybody has some bright ideas here.