TryQuiet / quiet

A private, p2p alternative to Slack and Discord built on Tor & IPFS
https://www.tryquiet.org
GNU General Public License v3.0
1.94k stars 85 forks source link

Libp2p doesn't discover and dial peers but the ones provided in the invitation code #1982

Closed kingalg closed 10 months ago

kingalg commented 11 months ago

Mobile: mobile@2.0.1-alpha.7 (iOS 325) Desktop: quiet@2.0.1-alpha.7 ( to the lesser extend, it's not that significant on desktop, may be not an issue at all, but I wanted to make a note about version that I was using during those tests)

Quiet started to be very slow and it's disconnecting a lot. Also, sometimes it's not syncing after coming back from background. In previous versions it was working rather consistently so this is new problem in this version.

holmesworcester commented 11 months ago

@kingalg is this in a new community or joining our own quiet community? (I think it's in a new community but just confirming)

@siepra can you work with kinga to make sure you can reproduce this? I think the most important thing is that you're able to reproduce whatever problem(s) kinga is seeing. And can you switch to working on this at the next convenient stopping moment? It's not urgent until the rest of the Sprint is close to completion but it may be good to start investigating it now.

holmesworcester commented 11 months ago

To me this belongs in sprint because it's a blocker to the release. Let me know if there's some reason why it is blocked otherwise let's leave it in sprint.

kingalg commented 11 months ago

@holmesworcester yes, it belongs to sprint, it was missclick on my part. It was tested on new communities. One of them was one day old, had few hundred messages and about 20 users (most of them disconnected) and the rest were completely new, created only to test this issue.

vinkabuki commented 11 months ago

one theory to test is to check if tor process is not shutting down.

holmesworcester commented 11 months ago

We decided that we will attempt to reproduce this in the office with kinga on a developer's device where we can see logs.

EmiM commented 11 months ago

This has always been an issue but it was just not visible when we had registrar because joining users were receiving list of all peers registered in community.

What we can do?

  1. Investigate if we can make libp2p connect to other peers in the network. So if B connects to A and C connects to A then can B connect to C?
  2. implement a workaround by passing to libp2p a list of addresses gathered from csrs
  3. Bump libp2p. However this is not easy and I don't know if even possible right now. Bumping libp2p requires bumping ipfs and bumping ipfs is blocked by orbitdb.

I'd go for first option and if this doesn't work out I'd fallback to workarounds.

holmesworcester commented 11 months ago

I think what we want is to use the addresses from CSRs and orbitdb.

Libp2p probably makes different assumptions about the context so if we rely only on libp2p's default peer discovery mechanism this will be suboptimal or insufficient for our case.

We want to make sure that:

  1. All synced clients know about all peers, so that even if only one of those peers is online, the user will eventually connect to them.
  2. Clients bias to last-seen and most-online peers when connecting, so that having a large community with many offline peers does not make connecting to the guest peer slower and slower (because connecting over Tor takes some time and we can't make hundreds of simultaneous attempts.)
  3. Clients still try peers according to libp2p's own logic, so that the network doesn't just center around a few often-online peers, which it would if we only connected to the most online peers. (This piece is tricky to reason about.)

We should also make sure there are tests for these cases so that we don't inadvertently break these properties with future changes. The last case will be hard to make a test for, but perhaps it's something like, if A, B, and C are online in a large community with hundreds of peers offline, and A and B are connected and B and C are connected, A and C should soon be connected.

holmesworcester commented 11 months ago

Also, solving this problem was part of the initial epic as described in #1340 , but it got missed when dividing the epic up into tasks:

Note on peer discovery: this might already be taken care of but we should make sure that users still discover all peers in the community when they connect and start syncing, by syncing this data from OrbitDB. It's okay if their addresses are not added to this list until they are registered with owner.

We should run through the initial epic as described and make sure we're not missing anything else, perhaps?

holmesworcester commented 10 months ago

Questions about how this will work:

  1. At what point in joining does a new user know the entire list of all peer onion address and peer ID in the network?
  2. How do they learn this?
  3. How do all existing members of a community learn of a new member's onion address and peer ID, and when?
  4. Are there other mechanisms through which peers discover onion addresses or peer IDs other than the above?
  5. What is a peer's strategy now for connecting to other peers?
  6. Who do peers try first?
  7. How many peers do they try?
  8. Do we think this makes sense and will result in all peers being connected in a graph as desired?
  9. Have we tried other approaches before and what did we think about those?
  10. Will a peer who never restarts update their list of peers? Or is it only updated on restart?
  11. What's the history of problems solved here?
  12. What regression tests do we have to cover solved problems?
  13. Any other thoughts?

(For example: we needed to connect to last seen / often online because otherwise having many peers who joined once and then left would mess up the community because trying all of them would take too much time. Were there any other fixes related to discovery? Libp2p didn't consider new peers that joined until we gave libp2p a new list. Peer discovery never worked.

Let's do some quick documentation of how this works. Throw some notes in a markdown file in the wiki or a docs directory.

vinkabuki commented 10 months ago

@holmesworcester @leblowl

  1. Once replicating all csrs
  2. Through orbitdb db csrs db replication
  3. Once they replicate their csr
  4. Yes, through invitation link
  5. We provide dial list to libp2p and we manually dial peers until connecting to the first peer.
  6. When joining they dial peers as given in the invitation link, for returning users we collect data of last seen peers and most shared connection time with, it's also what we share in invite link
  7. max 4 for the new uses, have to check for returning users.
  8. It makes sense but needs to be validated, and possibly improved.
  9. There is this native libp2p DHT module. Also in past iterations we were providing bootstrap list, at some point we discovered that libp2p autodialer doesn't work well so we implemented our own mechanism for dialing peers.
  10. atm we don't add new peers to libp2p on the runtime.
  11. bootstrap address list not working and replaced with our dialing mechanism. autodialer was dialing only one peer at the time which was super slow.
  12. We should definitely add some.
  13. @vinkabuki and @EmiM opinion is that our own dialer for many reasons (problems with native solutions, using non standard websocketOverTor transport, specific requirements we very often face, and ease of use and control) is the best option.

Things to improve:

  1. Ensure libp2p is autodialing provided peers.
  2. Add every valid replicated peer info to libp2p on runtime.
holmesworcester commented 10 months ago

Some follow up questions:

Re: question 4, what happens if peer information is in an invitation link but not in the CSRs?

Re: question 6, when someone new joins, does dialing new peers begin as soon as we receive the CSRs? Or do they only dial the peers in the invitation link, even once they replicate CSRs?

Re: question 10, the fact that we do not dial new peers until restart at seems like a bug. If we have a node that is always on, it will never restart, so it will never dial new peers. Another problem is if a new user joins by connecting to 1 online peer and then loses that connection, it sounds like they will not dial any new peers.

For regression tests, let's add some! What cases would we like to cover?

EmiM commented 10 months ago

Re: question 4, what happens if peer information is in an invitation link but not in the CSRs?

We dial peer from invitation link. We have to connect to someone to even replicate csrs. Does temporary lack of CSR mean that given peer is suspicious? Or can it mean that that we just don't have csr yet?

Re: question 6, when someone new joins, does dialing new peers begin as soon as we receive the CSRs? Or do they only dial the peers in the invitation link, even once they replicate CSRs?

In current implementation we only directly dial peers from invitation link. We will be dialing as soon as we replicate CSR but only if we are not already connected to given peer.

Re: question 10, the fact that we do not dial new peers until restart at seems like a bug. If we have a node that is always on, it will never restart, so it will never dial new peers. Another problem is if a new user joins by connecting to 1 online peer and then loses that connection, it sounds like they will not dial any new peers.

Yes, however if peer is always online someone else may dial the peer eventually. I'm thinking, maybe we could start dialing all known peers again if we reach 0 connections and stop as soon as we connect to someone.

holmesworcester commented 10 months ago

We dial peer from invitation link. We have to connect to someone to even replicate csrs. Does temporary lack of CSR mean that given peer is suspicious? Or can it mean that that we just don't have csr yet?

Will the address from the link stick around after, even once someone has replicated many CSRs? Ah, I suppose we don't know when we have replicated all CSRs.

In current implementation we only directly dial peers from invitation link. We will be dialing as soon as we replicate CSR but only if we are not already connected to given peer.

So this is in progress as part of this work, correct? Or will we create a new issue?

I'm thinking, maybe we could start dialing all known peers again if we reach 0 connections and stop as soon as we connect to someone.

We should have a target number of connections, like 6 or 8, and dial peers if we fall under that number.

Should all not connected peers be dialed indefinitely?

In at least one case it would be helpful: say there are 100 online peers, but they are all new peers, so a returning peer doesn't know their addresses. If they all dial not-connected peers randomly at some cadence, they would eventually dial the returning peer. If they don't, the returning peer would never connect.

I suspect there are other cases when it's helpful too. That said, I think we did learn that there is a performance cost to attempting lots of connections over Tor, so maybe we don't want to overdo it.

Another question is: do we know why libp2p's logic for this stuff not working? It might be that there are subtleties here where we don't want to have to roll our own approach to cover all the cases.

EmiM commented 10 months ago

So this is in progress as part of this work, correct? Or will we create a new issue?

Yes, this is part of this work

Another question is: do we know why libp2p's logic for this stuff not working?

Maybe it's because of our websocketovertor, Bartek mentioned that he noticed that we may be lacking some implementation there.

EmiM commented 10 months ago

We have to decide what's the DOD here.

If we want to do it properly we probably have to write our own dialer. Especially if we're talking about full control on when we are connecting to who and when we are dropping the connection and also maybe how often do we want to dial given peer. Plus we will maybe have more control over connections in the future. However writing own dialer can be a bigger task.

If we want to temporarily just fix the known issues with glue we also can do that, it'll be done quicker but it is not a long term solution and we will probably have to get rid of it soon anyway.

holmesworcester commented 10 months ago

Let's do the easiest fixes now. We are on an old version of libp2p. We should return to this after we upgrade, since the problems might be different.

EmiM commented 10 months ago

By "easiest" you mean "only fix basic known issues and do not write custom dialer"?

holmesworcester commented 10 months ago

I mean make some decision about what the most valuable/easy fixes are and do those.

What do you think they are?

holmesworcester commented 10 months ago

Let's keep this focused on the problem this issue was created for (libp2p only dials peers in the invitation code) and create separate issues for the other problems, since this problem is new and related to current work, while the others are not.

Other issues:

  1. Peers never try to dial peers randomly, so partitions can happen. We have a separate issue for this: https://github.com/TryQuiet/quiet/issues/2056
  2. Peers don't dial more peers when they lose connections, when they should try to stay connected to at least 6 peers. https://github.com/orgs/TryQuiet/projects/3/views/1?pane=issue&itemId=44468676
  3. Any others?

So I think we just need to make sure newly replicated peer info becomes part of our list of peers without restart. Is that right? Do we think this requires writing a custom dialer?

holmesworcester commented 10 months ago

Here are some reasons weighing on the side of "don't make changes internal to libp2p yet":

  1. We are on an old version of IPFS
  2. We are probably on an old version of libp2p
  3. We want it to be easy to upgrade libp2p
  4. We don't yet know if js-ipfs is sufficient for our work on iOS, so there's still some risk of needing to change.
  5. In our experiments with iOS we will have an always-on node that is reachable without Tor, and we will connect to this first in many cases. That will make some of these problems go away, and we can return to them when we need to.

@vinkabuki @EmiM thoughts?

EmiM commented 10 months ago

Writing custom dialer is probably in our future anyway but we can wait for upgrading libp2p and in the meantime try other approach.

I looked at libp2p dialer code and we may be able to get more out of libp2p connection manager configuration by setting proper minConnections/maxConnections (we already do use that but we may adjust it a bit) and tagging peers. Libp2p uses peer tags for deciding which connection needs to be pruned first if number of connections reaches maxConnections.

holmesworcester commented 10 months ago

@EmiM so what are your next steps?

EmiM commented 10 months ago
holmesworcester commented 10 months ago

Note:

EmiM commented 10 months ago

https://github.com/TryQuiet/quiet/pull/2051

holmesworcester commented 10 months ago

It would probably be good to test this a lot on desktop and Android for any performance or battery regressions, in a community with many peers.

Do we remember why we disabled the autodialer in the past? Was it just dialing way too much?

EmiM commented 10 months ago

We've never diabled autodialer.

We added initial dialing on our part because libp2p bootstrap was slow and was dialing only one peer at a time. Libp2p autodialer still works and have been working this whole time, it just uses peers added to libp2p's internal peer book. Peer book is updated with peer data when peer is dialed. So to sum up - we dial initial list of peers (and after this fix - all new peers) and libp2p's autodialer takes care of keeping/establishing at least minConnections number of connections.

holmesworcester commented 10 months ago

Great!

kingalg commented 10 months ago

Version: 2.0.3-alpha.10 mobile@2.0.3-alpha.11 (iOS 338)

As far as I could check this issue is fixed.