JoinMarket-Org / joinmarket-clientserver

Bitcoin CoinJoin implementation with incentive structure to convince people to take part
GNU General Public License v3.0
729 stars 178 forks source link

Directory nodes constantly disconnecting #1435

Open PulpCattel opened 1 year ago

PulpCattel commented 1 year ago

EDIT: as per https://github.com/JoinMarket-Org/joinmarket-clientserver/pull/1436#issuecomment-1399480855, the spam issue should be mostly solved. Renamed this issue to focus on the "disconnect" part. Hopefully the changes in #1436 will improve this already.

EDIT: g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad is constantly disconnecting too now.

As reported by many different users and contributors, the following two directory nodes are disconnecting in a repetitive/constant pattern:

The log is simply a constant repetition of:

[DEBUG]  We got a disconnect event: 3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222
[INFO ]  Updating status to connected for peer 3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222.

It started last week from what I can gather from users report, but I can't confirm it wasn't happening earlier too. At this point, it seems plausible to me this is a bug in our code logic.

The simplest explanation that comes to my mind is that we are not handling correctly some connectivity issue edge case.

Curiously, the newly added directory node g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad.onion:5222 is unaffected. This node is not yet in a release, so it might simply be because of less network traffic (EDIT: this directory is now experiencing exactly the same problem).

FWIW, signet directories also appear to be working.

This issue might resolve itself, but IMHO it's good to look into it to prevent it from happening again.

AdamISZ commented 1 year ago

Just dropping a note on what I've been seeing for the last few days since #1436 merge: though we back off (with that code ofc, most users are on 0.9.8), we still seem to be basically permanently shut off from two of the three directory-nodes. More work required, even though that patch did have the desired effect, and bots can actually run normally without log spam.

Re: "more work", I think a good place to focus on is how the directory node onion services are configured. Perhaps somewhere buried there's a setting that can allow more connections? Or at least, figure out what the connection limits are (both in timing of requests, and total number, say)?

PulpCattel commented 1 year ago

Just focusing about Tor, this could be a start https://community.torproject.org/onion-services/advanced/dos/

For more crude details we might have to dig Tor source code and spec at some point.

AdamISZ commented 1 year ago

Was reading docs for a related project, and noticed this: https://tunnelsats.github.io/tunnelsats/FAQ.html#tuning-tor ... perhaps worth investigating? I believe someone mentioned LongLivedPorts on this repo a while ago. Edit: yes, see https://github.com/JoinMarket-Org/joinmarket-clientserver/commit/fbcb9fd377f805fdbc59cbad8e1faaa7de9eb495 - hence 5222.

AdamISZ commented 1 year ago

Interesting point from man tor: ConnLimit defaults to 1000. Second, I observed earlier today that about 1000 !orderbook messages were propagated by my dir node. Third, the way we use tor is that a separate instance of the tor process is run for each directory node.

Though we see only perhaps a couple of hundred makers (at best), we could very easily be supporting 1000 takers+makers+obwatchers. I'll look into changing this. (though, i still don't actually understand why it would be 'connect then disconnect' and not just 'refuse', if indeed this is in any way related).

Edit: hmm, on a second read, unfortunately that appears to be a minimum not a maximum. So probably irrelevant but may as well leave the note here.

AdamISZ commented 1 year ago

In case anyone else is investigating this like me, here's a point that might be very important: how to see detailed logs of the tor instance in which your directory node is running:

Note that we spawn a new tor process for it, and that process can be set to log to a particular file by editing path-to-jmcs/jmvenv/lib/python3.xx/site-packages/txtorcon/controller.py line 314 by extending this list:

config_args = ['-f', '/dev/null/non-existant-on-purpose', '--ignore-missing-torrc']

to something like:

config_args = ['-f', '/dev/null/non-existant-on-purpose', '--ignore-missing-torrc', 'Log', 'info file yourfilename.txt']

(read man tor, but notice that while on command line you need double quotes around the multi word value for the Log option, here, you don't need th edouble quotes). Another thing is that the location of the log file created is in /tmp/tortmpxxxxx where the random xxxxx can be seen from ps aux | grep tor or similar.

Log levels here are debug, info, warn, notice or similar as per docs/man page. I think info is already pretty verbose, debug looks overwhelming. I'm going to be tail -f-ing this to help me see what transpires with different setups.

(You can also control the output of journalctl with options in the .service file, like StandardOutput:journal+console but it doesn't help you with the tor log, hence the above).

Please let me know if you have other info about logging tor, here.

AdamISZ commented 1 year ago

An addendum: when tail-ing that output, it was still overwhelming (which itself may be interesting!) for my mainnet dnode, so you may find notice more realistic than info (as, in fact, the man page suggests). Let's see.

CohibAA commented 1 year ago

Seeing this pretty often still when launching bots. Curious why sometimes they will reconnect for very short periods of time.

2023-03-xxxxxxxxxxxxxxx [WARNING]  Failure to send message to directory: g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad.onion:5222.
2023-03-xxxxxxxxxxxxxxx [WARNING]  Failure to send message to directory: 3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222.
2023-03-xxxxxxxxxxxxxxx [WARNING]  Failure to send message to directory: bqlpq6ak24mwvuixixitift4yu42nxchlilrcqwk2ugn45tdclg42qid.onion:5222.

So, I decided to should start my own directory node to dig into why. Tor is quite a bit more sophisticated than I am, so it was a lot of trial and error to even get going. Now that it's working, I guess it's time to try to break it.

If anyone wants to connect, here is the list of directory_nodes I am currently using for reference. This includes the newest from the stopgap list of directory nodes, #1445, to date of this comment, as well as the new node from https://github.com/JoinMarket-Org/joinmarket-clientserver/issues/1456.

directory_nodes = plq5jw5hqke6werrc5duvgwbuczrg4mphlqsqbzmdfwxwkm2ncdzxeqd.onion:5222,g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad.onion:5222,3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222,bqlpq6ak24mwvuixixitift4yu42nxchlilrcqwk2ugn45tdclg42qid.onion:5222,ylegp63psfqh3zk2huckf2xth6dxvh2z364ykjfmvsoze6tkfjceq7qd.onion:5222
PulpCattel commented 1 year ago

From what I can see, g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad is "down" too now, same pattern as the other directories. RIP.

It really seems like there's a magic threshold that we reach and things stop working.

CohibAA commented 1 year ago
2023-03-27 xxxxxxxxxxxx [WARNING]  Failure to send message to directory: g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad.onion:5222.
2023-03-27 xxxxxxxxxxxx [WARNING]  Failure to send message to directory: 3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222.
2023-03-27 xxxxxxxxxxxx [WARNING]  Failure to send message to directory: bqlpq6ak24mwvuixixitift4yu42nxchlilrcqwk2ugn45tdclg42qid.onion:5222.

Seeing the same thing here still. I've been monitoring the tor logs for the directory node I setup, but nothing really unusual has presented itself yet, and it still hasn't been hit by the "magic threshold" as far as I can tell. Please report if you have issues with "wkd3kd73ze62sqhlj2ebwe6fxqvshw5sya6nkvrgcweegt7ljhuspaid.onion"* and I can reference back to logs more directly possibly.

edit: The "wkd3" onion referenced above has been replaced by plq5jw5hqke6werrc5duvgwbuczrg4mphlqsqbzmdfwxwkm2ncdzxeqd.onion due to unrelated server downtime.