Research the hole-punching solution for DSN.

shamil-gadelshin commented 1 year ago

Relates to https://github.com/subspace/subspace/issues/1022

dtynn commented 7 months ago

Hi, after digging into the circuit relay & dcutr protocols，I am wondering what kind of relaying node below are you planning to implement and provide?

subspace-bootstrap-node , as infrastructures
individual bin, like subspace-relay-node, self-hosted by network participants
any subspace-node or subspace-farmer instances with relay optionally enabled by their owners

or is there any other solutions?

shamil-gadelshin commented 7 months ago

You have enumerated possible options. However, centralized infrastructure (like bootstrap nodes) is a single point of failure. It's not critical though because relays only improve network capabilities rather than provide core services. Embedded relay functions for nodes or farmers with strict limits seem like a more robust solution. Hybrid solutions are also possible, for example - we start testing with dedicated relay nodes, measure the effect, and upgrade with mass relays.

dtynn commented 7 months ago

Thank you for responding.

I'm also thinking of embedded relay for nodes & farmers. for now, i have some basic assumptions:

a node should be either serverside or clientside, or even neither (relay disabled);
a serverside node may choose which reservation requests to accept based on:
- clients with dcutr-like protocol should be high priored, or event required
- clients with too many relayed listen-ons declared should be denied

I'm wondering if you guys have made some progress on this topic?

I've already looked through some related networking issues and prs but got little help, plz kindly remind me if i missed something.

shamil-gadelshin commented 6 months ago

a node should be either serverside or clientside, or even neither (relay disabled)

This status should be derived automatically using autonat protocol

clients with dcutr-like protocol should be high priored, or event required

I don't think libp2p supports QoS for protocols.

clients with too many relayed listen-ons declared should be denied

It's a possible security check. However, each relay will have a "connection limit per peer" as each peer has now. It will prevent reconnection from the same malicious peer.

I'm wondering if you guys have made some progress on this topic?

We didn't have any progress here besides adding autonat.

dtynn commented 6 months ago

This status should be derived automatically using autonat protocol

I mean, the behaviour shouuld be something like Either<libp2p::realy::Behaviour, libp2p::relay::client::Behaviour> so that one node would not relay traffic while asking some peers to connect to it through other relayers.

I don't think libp2p supports QoS for protocols.

Yes, and this may require customized relay behaviour, for example, check the supported protocols in libp2p::identify::Event::Received before handling libp2p::relay::handler::Event::ReservationReqReceived.

It's a possible security check. However, each relay will have a "connection limit per peer" as each peer has now. It will prevent reconnection from the same malicious peer.

I was worried that a node make reservation request to every node it reaches, but after thinking about the situation once again, i realized that it seems to be a normal scene in p2p networking, so just forget this.

We didn't have any progress here besides adding autonat.

Ok, I will check the source codes, the examples and other projects depend on libp2p.

Thank you very much for the information.

dtynn commented 6 months ago

Hi

After reading rust-libp2p/examples/relay-server rust-libp2p/examples/dcutr go-libp2p/p2p/host/autorelay kubo/core/node/libp2p/relay.go,and testing, I have some thoughts about the hole-punching.

First, I would describe the two basic roles as:

relayer : a peer which is public reachable and is willing to provide relay function for other peers.
relay-client: a peer that either: 1) not public reachable but would like to be relayed, or 2) would like to connect relayed peers.

Some conclusions:

port_reuse for tcp transport must be enabled to to allow dcutr protocol for a relay-client.
inside event loop:
- autonat::Event::StatusChanged { new: NatStatus::Public(addr), .. } => this node can be a relayer for other peers;
- autonat::Event::StatusChanged { new: NatStatus::Private, .. } => this node should be a relay-client.
- identify::Event::Received:
  - for a relayer, a public observed_addr should be added into swarm's external addresses, in order to allow relay reservation from others
  - for a relay-client, if the remote peer has relay::protocol::HOP_PROTOCOL_NAME, it would be recorded as a candidate relayer
- swarm::ConnectionEstablished { peer_id, endpoint: Dialer { address, role_override }, .. }:
  - with override Enpoint::Dialer => address could be used to construct a circuit listen addr used by a relay-client
  - with override Enpoint::Listener => this connection is use the to negotiate roles in dcutr protocol, and should be handled differently from above;

shamil-gadelshin commented 6 months ago

I would start following this document and its stages: https://docs.libp2p.io/concepts/nat/hole-punching/

We have already added the first ingredient (autonat) for nat detection. The next logical step is enabling the relayed connections without DCUTR. Several things to consider:

the solution should be an incremental non-breaking upgrade
potential relayers should have strict limits on connection number and preferably bandwidth to avoid abusing
identify protocol isn't a source of external addresses (only candidates) - autonat is.
tcp port reuse can't be enabled because it will lead to the "connection failed: IP-port is not unique" error because we can have several connections between peers.

The end goal is to improve the network connectivity by including private farmer's caches in the global pool.

nazar-pc commented 6 months ago

I'll clarify that the goal is to use relays ONLY for hole punching purposes and not to send traffic through those relays in case direct connection is possible. This is important.

dtynn commented 6 months ago

potential relayers should have strict limits on connection number and preferably bandwidth to avoid abusing

There are some default configurable limits for relay protocol Config, but no directly bandwidth controls.

identify protocol isn't a source of external addresses (only candidates) - autonat is.

That's correct. My previous concern here is: what will the relayer act if 2 transports are enabled (tcp & quic) while only 1 public addr is reported by autonat::Event or autonat::NatStatus.

However, after reading the source codes, I think there will not be problem. I'll check again later.

tcp port reuse can't be enabled because it will lead to the "connection failed: IP-port is not unique" error because we can have several connections between peers.

~AFAIK, tcp port reuse is the requirement for dcutr.~ According to the source code, it seems that the port usage (the address used in negotiation) is not limited. But my own demo did not work if port_reuse was set to false. I'll do more research on this.

dtynn commented 6 months ago

I'll clarify that the goal is to use relays ONLY for hole punching purposes and not to send traffic through those relays in case direct connection is possible. This is important.

I have the same view here: in our case, relay peers only help others to negotiate and get connected directly. But default dcutr seem to be an optional choise, so we may have to implement our own dcutr-like protocol. Some assumptions were in the previous comment of this thread.

shamil-gadelshin commented 6 months ago

I'll clarify that the goal is to use relays ONLY for hole punching purposes and not to send traffic through those relays in case direct connection is possible. This is important.

It will require a workaround to stop relaying piece traffic before we add DCUTR but it seems possible.

shamil-gadelshin commented 6 months ago

My previous concern here is: what will the relayer act if 2 transports are enabled (tcp & quic)

FYI: we're disabling QUIC here: https://github.com/subspace/subspace/pull/2647

dtynn commented 6 months ago

Again, about the port usage in dcutr:

external addr candidates are used in handshaking with the remote peer, see inbound handler construction / outbound handler construction
candidates are updated on FromSwarm::NewExternalAddrCandidate, see candidates update
by default, candidates come from identify protocol, and are translated to use the specified listen port, see report new candidate / handle ToSwarm::NewExternalAddrCandidate / default addr translation used in tcp transport

so, the listener port are always used in dcutr instead of the dialer port, and I think that's why port_reuse are required.

There maybe a workaround if we handle the dialer port mannually, but that seems a little bit complicated.

dtynn commented 6 months ago

I'll clarify that the goal is to use relays ONLY for hole punching purposes and not to send traffic through those relays in case direct connection is possible. This is important.

It will require a workaround to stop relaying piece traffic before we add DCUTR but it seems possible.

Maybe we can just mannually close the circuit relayed connection on any dcutr error to enforce direct connections?

However, this method will not function if both ends disable dcutr. In this case, we can set max_circuit_duration & max_circuit_bytes more strictly to avoid abuse.

nazar-pc commented 6 months ago

Setting number of bytes well under 1M will make it useless for piece retrieval, but should be sufficient for hole punching purposes.

shamil-gadelshin commented 6 months ago

I'm not sure I follow your 'port reuse' statements. We use the events you described as well as port translation, however, if you set port_reuse flag for TCP transport it will affect the Dialer and autonat protocol will stop working. In the current DSN clients always connect to 'listen port' because of the port translation you mentioned. I think DCUTR example uses port_reuse to not implement port translation manually.

Using bandwidth limits to prevent piece transfer will lead to a massive amount of errors and significant network degradation (we should set the limits anyway). I suggest implementing a "connection-check" in the PieceProvider from subspace_networking - in the key loop - 1) Kademlia-lookup -> 2) Piece-request - we can add extra steps: 1) Kademlia-lookup -> 2) Peer connection type check -> 3) request connection upgrade if relayed connection (with timeout) -> 4) Piece-request.

Step 3) could be optimized: a) once a destination peer receives an inbound connection through a relay - it establishes a new direct connection and closes all relayed connections to the source peer b) source peer performs the connection check and waits (with timeout) for a direct connection without additional request.

We can either try to reuse DCUTR or as you suggested previously - implement our own similar protocol.

dtynn commented 6 months ago

I'm not sure I follow your 'port reuse' statements.

A default DCUTR upgrading case:

peer A dials to relay peer R, and gets an observed_addr_of_A with a random port from the underlying outbound connection, let's say: /ip4/some_public/tcp/61234/p2p/some_peer_id. The observed_addr_of_A is added as an external candidate.
A announces a relayed listen addr, and then peer B connect to it through R.
A proposes an upgrade upon the relayed conn, send the observed_addr_of_A to B, gets a candidate from B. Let's assume that the candidate of B is also not public accessible to make it simple. Then A start a tcp simulation: dial candidate addr of B (open a new tcp conn with another random port which is obviously not 61234) while waiting for an incoming dial from B to 61234.

A and B will always fail dialing to each other in this situation..

So,IMO, the main point is: A should tell B an addr for dialing in, which also means: A provides a port mapped by the route for B to send SYN(the hole). If we disable the port_reuse flag, I think this will not happen.

Hope I make my thoughts clear. Correct me if I made any mistake.

if you set port_reuse flag for TCP transport it will affect the Dialer and autonat protocol will stop working.

Yes, port_reuse will make autonat confused when direct connections created by dcutr inbound. There may be some work arounds:

increase the confidence_max,
ignoring addrs used in dcutr protocol.

For the hole-punching/port_reuse/autonat topic, I found another discussion from libp2p/specs: Consider only reusing TCP port when hole punching #389

Kademlia-lookup -> 2) Peer connection type check -> 3) request connection upgrade if relayed connection (with timeout) -> 4) Piece-request.

Maybe we can enforce the relay connections to upgrade to dcutr, regardless of what kind of requests they will send, so that we don't do any checks?

shamil-gadelshin commented 6 months ago

I think your workflow misses the address_translation call in the ToSwarm::NewExternalAddrCandidate event handler in the swarm code. The announced port won't be random it will be the listening port, at least it works like this now, I didn't test relayed configurations but I don't see a reason why it should work differently. Please, let me know if you spotted an error here.

Yes, port_reuse will make autonat confused when direct connections created by dcutr inbound.

Autonat will likely stop functioning correctly. Consider the following situation outside of the relay-dcutr case: a public peer A establishes a connection to another public peer B, having received a new connection - peer B will try to confirm its observed address. The current autonat settings lead to choosing the server from the connected peer list. B will request A to connect again (autonat probe). If we set port reuse then the probe will issue a second connection with a duplicate tuple (address A, port A, address B, port B) and fail with this reason. We encountered this error when were adding autonat.

I found a comment from DCUTR specification contributor without details in the discussion you noted: https://github.com/libp2p/specs/issues/389#issuecomment-1575532978

Maybe we can enforce the relay connections to upgrade to dcutr, regardless of what kind of requests they will send, so that we don't do any checks?

We likely can force the upgrade. But how will you prevent piece exchanges via relayed connection during the upgrade? When we issue a Kademlia request and establish a connection we immediately start getting a heavy piece from the remote peer. Did you mean another synchronization here?

dtynn commented 6 months ago

I think your workflow misses the address_translation call in the ToSwarm::NewExternalAddrCandidate event handler in the swarm code. The announced port won't be random it will be the listening port, at least it works like this now, I didn't test relayed configurations but I don't see a reason why it should work differently. Please, let me know if you spotted an error here.

I think I know what we missed before:

To make hole-punching functional, 2 peers should connect to each other with the same tcp-4-tuple, even neither of the ports in the tuple is listened locally. Correct me if I'm wrong here.
By setting port_reuse to true, the actual point is to bind the local external candidate addr before connect. Otherwise, 1) will not be satisfied.
By default (based on the tcp transport), only by enabling the port_reuse can make the following same:
- The external candidate addr, which is also the addr reported to the remote in the DCUTR protocl
- The addr binded to the socket before connecting.
That's why DCUTR only functions in this case (), according to here. Even we make another address_translation method, to let the exact observed_addr (without replacing the port part) be added as external candidate, the negotiation will fail, as a result of the missing binding part.

So if we can find an easy way to make the bind & connect thing happen as wished, then we could forget the port_use flag in the config. That means a custom (tcp) transport impl, & some config item like port_for_hole_punching ( or random picked on startup).

Autonat will likely stop functioning correctly. Consider the following situation outside of the relay-dcutr case: a public peer A establishes a connection to another public peer B, having received a new connection - peer B will try to confirm its observed address. The current autonat settings lead to choosing the server from the connected peer list. B will request A to connect again (autonat probe). If we set port reuse then the probe will issue a second connection with a duplicate tuple (address A, port A, address B, port B) and fail with this reason. We encountered this error when were adding autonat.

Agreed, and that's why I suggest drop the already known addrs from hole-punching for the autonat probe.

We likely can force the upgrade. But how will you prevent piece exchanges via relayed connection during the upgrade? When we issue a Kademlia request and establish a connection we immediately start getting a heavy piece from the remote peer. Did you mean another synchronization here?

I have no idea about this part now 😂 I will look into it later.

dtynn commented 6 months ago

When we issue a Kademlia request and establish a connection we immediately start getting a heavy piece from the remote peer.

Can we wrap the kad::Behaviour, ignore the established connection with relayed endpoints?

AFAIK, kad manages its connected peers based on other_established in FromSwarm::ConnectionEstablished & remaining_established in FromSwarm::ConnectionClosed. But it seems no harm if we construct dummy handlers for the relayed connections while handling events from them by default .

There's also one concern, is the kad protocol the only one we don't want on relayed connections?

shamil-gadelshin commented 6 months ago

I started a discussion in the upstream repository with both our questions about DCUTR and autonat: https://github.com/libp2p/rust-libp2p/discussions/5291

shamil-gadelshin commented 6 months ago

When we issue a Kademlia request and establish a connection we immediately start getting a heavy piece from the remote peer.

Can we wrap the kad::Behaviour, ignore the established connection with relayed endpoints?

AFAIK, kad manages its connected peers based on other_established in FromSwarm::ConnectionEstablished & remaining_established in FromSwarm::ConnectionClosed. But it seems no harm if we construct dummy handlers for the relayed connections while handling events from them by default .

There's also one concern, is the kad protocol the only one we don't want on relayed connections?

Kademlia is not an issue here because it uses a rather small number of bytes. I meant this line: https://github.com/subspace/subspace/blob/443b30652f64da1d91fe630758d5ee4168b565b9/crates/subspace-networking/src/utils/piece_provider.rs#L72

Piece requests consume the main traffic.

dtynn commented 6 months ago

When we issue a Kademlia request and establish a connection we immediately start getting a heavy piece from the remote peer.

Can we wrap the kad::Behaviour, ignore the established connection with relayed endpoints? AFAIK, kad manages its connected peers based on other_established in FromSwarm::ConnectionEstablished & remaining_established in FromSwarm::ConnectionClosed. But it seems no harm if we construct dummy handlers for the relayed connections while handling events from them by default . There's also one concern, is the kad protocol the only one we don't want on relayed connections?

Kademlia is not an issue here because it uses a rather small number of bytes. I meant this line:

https://github.com/subspace/subspace/blob/443b30652f64da1d91fe630758d5ee4168b565b9/crates/subspace-networking/src/utils/piece_provider.rs#L72

Piece requests consume the main traffic.

Ok, it finally calls kademlia.get_record. I'll check it.

dtynn commented 6 months ago

I started a discussion in the upstream repository with both our questions about DCUTR and autonat: libp2p/rust-libp2p#5291

To keep the other issue clean, I just comment here.

It seems that iroh-p2p uses port_reuse as well in the latest code, see build_transport.

Did you see any previous version that enables DCUTR without port_reuse? Please share the commit / tag / branch, so maybe I can test the code.

shamil-gadelshin commented 6 months ago

It seems that iroh-p2p uses port_reuse as well in the latest code, see build_transport.

Hmm, It's correct. My local iroh code turned out to be very old. That version had both dcutr and no port_reuse.

shamil-gadelshin commented 6 months ago

Ok, it finally calls kademlia.get_record. I'll check it.

FYI: We don't use get_record API - we use get_providers

dtynn commented 6 months ago

Ok, it finally calls kademlia.get_record. I'll check it.

FYI: We don't use get_record API - we use get_providers

Got that.

I followed this line.

However, they all fell on the handle_command call, and I think there won't be much difference. I'll be more careful next time.

dtynn commented 6 months ago

Hi, I wrote a demo to show one possible way of solving the holepunch vs autonat problem, and avoiding any kad traffic based on relayed connections.

The demo includes:

a self-defined address format wih the usage of the deprecated Protocol::P2pWebRtcDirect (see here), and a HolePunchTransport to handle it;
a wrapper of dcutr::Behaviour to only let addrs in 1) through;
a wrapper of autonat::Behaviour to ignore addrs in 1);
a wrapper of kad::Behaviour to ignore relayed connections.

A minimum showcase group of peers can be started by:

// 1. a peer to provide relay function
RUST_LOG=info ./libp2p-relay-demo --listen-port <port1> --seed 1 --relay-service

// 2. a peer helping autonat detection
RUST_LOG=info ./libp2p-relay-demo --listen-port <port2> --seed 2 --connect=/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X

// 3. a peer behind nat/firewall with kad enabled
RUST_LOG=info ./libp2p-relay-demo --listen-port <port3> --seed 3 --connect=/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X --listen-relayed --dcutr-port=<dcutr_port_of_peer3> --kad

// 4. a peer connected to 3) and send a kad put
RUST_LOG=info ./libp2p-relay-demo --listen-port <port4> --seed 4 --connect="/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X" --peer="/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X/p2p-circuit/p2p/12D3KooWQYhTNQdmr3ArTeUHRYzFg94BKyTkoWBDWez9kSCVe2Xo" --dcutr-port=<dcutr_port_of_peer4> --kad --kad-put="ka:va"

// 5. a peer connected to 3) and send a kad get
RUST_LOG=info ./libp2p-relay-demo --listen-port <port5> --seed 5 --connect="/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X" --peer="/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X/p2p-circuit/p2p/12D3KooWQYhTNQdmr3ArTeUHRYzFg94BKyTkoWBDWez9kSCVe2Xo" --dcutr-port=<dcutr_port_of_peer5> --kad --kad-get="ka"

shamil-gadelshin commented 6 months ago

First of all, thank you for the demo - I appreciate the effort.

Did you try it within a distributed cluster (i.e. AWS + home)?

I understood your solution as follows:

create a DCUTR wrapper to separate an address with port reuse from the relayed address and enable autonat this way,
create a Kademlia wrapper to not update its routing table during relayed connection requests,
use obsolete P2pWebRtcDirect protocol to distinguish address types.

If there are other key pieces, please, feel free to add them.

1) Is this description of the process correct?

Peer A (private), peer B(any), peer R(public relay), other peers.

A listens on DCUTR port and "regular port".

a) A connects to R using any port, A starts listening on the "port returned from Identify event for a connection to R" for relayed connections, A confirms its relayed address as external (via autonat and other peers) and advertises it via Kademlia
b) B gets a relayed external address from other peers (using Kademlia) and establishes a relayed connection to A
c) A initiates DCUTR protocol with B with port reuse on DCUTR port (different from relayed address port and regular port) and closes other connections with B
d) B gets data from A using direct connection

We don't need to protect Kademlia routing table from relayed addresses using wrapper. We use manual Kademlia mode and add addresses with filtering already - we can just add another filter.
Did you try adding a custom string and custom protocol parsing instead of obsolete protocol? There are many reasons to avoid using obsolete protocols (the obvious one - Protocol Labs will remove it upstream). There is an option to create a fork and add the new protocol manually but its better to avoid forks. Another question - do you think its possible to implement this separation using application level protocol like ping?

dtynn commented 6 months ago

Did you try it within a distributed cluster (i.e. AWS + home)?

Yes, I myself started the showcase group of peers above on two machines for testing:

peer 1(R) / 2 / 3(A) on the first machine, connect to each other via the public host ip
peer 4(B1) and 5(B2) on another machine, trying to connect to peer 3 and do same kad get/put action;
port of peer 1(R) is public accessible while all other ports are private

The peers worked as expected.

Is this description of the process correct?

Most are correct except 2 points:

for a): users specify a dcutr port with a flag --dcutr-port=<some port>, and peer A listens /ip4/0.0.0.0/tcp/<dcutr port>/p2p-webrtc-direct on start up, this addr is handled by HolePunchTransport.
for b): the relayed address of peer A is specified with a flag --peer=<some peer addr>, but i think it's possible to find the peer from within the network (since the logic of all the protocols are not changed).

The main reason for 1) is: The codes of the tcp socket creation(connect & bind) is hiden deeply inside the tcp transport's implement, so I decided reuse them (at least in this demo).

That is to say, for now, the HolePunchTransport is just a simple wrapper that receives self-defined direct addresses in both listening & dialing, translates them to normal addrs, and passes them to the underlying tcp::Transport.

The bad part of this decision is: all the peers with dcutr enabled in my demo actually listen to 2 ports (one regular & one for dcutr). However, if we finally decide not to listen any other ports except the regular one, we can achive it quick and simple by impl a totally customized transport.

Did you try adding a custom string and custom protocol parsing instead of obsolete protocol?

Yes, I tried to do so, but all the protocols that can be used in multiaddr are pre-defined as a enum and there are no such things like Protocol::Custom(String) or a trait MultiaddrParser. Also we can do nothing to the Multiaddr::iter part.

Another question - do you think its possible to implement this separation using application level protocol like ping.

Do you mean, if we can format a multiaddr with application level protocol, get something like /ip4/<some host ip>/tcp/<some port>/ping ? That's impossible. The two so called protocols are not the same.

shamil-gadelshin commented 6 months ago

That's impossible. The two so called protocols are not the same.

Application level protocol can manage connections. Track existing and start new ones. Potentially, it can contain all the logic that you have spread between swarm and behavior events. I wonder, are there any drawbacks to using that. Maybe it's too inconvenient.

shamil-gadelshin commented 6 months ago

@nazar-pc What do you think about having an additional separate port for DCUTR?

dtynn commented 6 months ago

Application level protocol can manage connections. Track existing and start new ones. Potentially, it can contain all the logic that you have spread between swarm and behavior events. I wonder, are there any drawbacks to using that. Maybe it's too inconvenient.

Ah, you mean implement a new appilcation level protocol (NetworkBehaviour) to manage all the holepunch stuffs.That's possible.

The only problem is the cost & difficulty to maintain our customized Transports & NetworkBehaviours. For now, I didn't see much trouble.

The demo was aimed at finding the key points & edge cases, and it is obviously not the final solution.

nazar-pc commented 6 months ago

@nazar-pc What do you think about having an additional separate port for DCUTR?

I don't like it to be honest and after skimming above it will not help with hole punching on the ports we want, it would basically be a separate port with 1 connection per peer on that port, is that correct? If so then this is not how it should work IMO.

dtynn commented 6 months ago

I think I should list some conclusions here after all these long comments.

Basic conclusions

Do we have to listen to a separate port to make hole punching functional —— NO Why did I listen to a separate port in the demo? —— To reuse the original tcp::Transport.
Do we have to choose one specific port for hole punching —— NO Why did I choose one specific port in the demo —— To avoid too many NewExternalAddrCandidates as there are other protocols based on them.

In above, the hole punching is not equal to the dcutr protocl provided by libp2p.

How could we achive the goals?

First of all, the basic progress:

choose one local port to be used in dialing to remote
exchange the ports between two peers through a relayed connection
dial to each other directly at the same time thus the hole punching

If we don't want to change the progress too much, then we should:

implment a Transport that can bind a specific local addr to the socket before connecting.
implment a NetworkBehaviour that handles the hole punching progress.

Two queistions:

How do we choose the local port(s)?
1. config from users
2. random picked
3. ports already used in established connections (at runtime)

I suggest i) or ii)

How do we tell the underlying Transport about what local port we want to bind in some specific dialing action?
1. for i) & ii), define a customized multiaddr format, like /ip4/<some ip>/tcp/<some port>/direct, use them in (virtual) listening on start up, and in dialing for hole punching.
2. for iii), since we have to tell the Transport about all the informations for socket creation only when we call dial method, then a format like /ip4/0.0.0.0/tcp/<local port>/direct/ip4/<remote ip>/tcp/<remote port> should be ok

If you guys agree with the above, or tell me what you prefer, I think I can complete a new demo in a couple of days. Then we can start another round of discussions about the details and to see if we can integrate the codes into subspace-networking

shamil-gadelshin commented 6 months ago

TL;DR

After the research conducted with a major contribution from @dtynn we chose to not proceed with DCUTR at this moment.

Rationale

Protocol labs global experiment with punchr reported 60-80% of success rate with DCUTR with both TCP and QUIC protocols. TCP success rate (25%) is much worse than with QUIC (75%).
Our last network measurement using random-walker (24h) found 1/3 of publicly accessible peers.
If we take a total TCP DCUTR success rate as 20% we can potentially add 50% of new publicly accessible peers which will significantly improve our connectivity.
We can't switch to QUIC protocol to improve the DCUTR success rate because it causes router problems in some setups. We disabled it recently: https://github.com/subspace/subspace/pull/2647
DCUTR relies on relayed connections and we don't want relayers to route heavy traffic (pieces) and it updates our main data acquiring loop with the connection type check: a) get-providers -> b) detect connection type and wait for connection upgrade if relayed -> c) get piece.
"Connection detection" step b) will result in waiting in 66% of cases after adding DCUTR (because of the massive addition of possible relayed connections) and it will fail in 80% of cases because of the projected TCP DCUTR success rate. Multiple connection errors will require increasing our retries number. The piece acquiring time will likely increase significantly maybe even 2-3 times.

Summary

Having the perfect implementation we will have a more robust network (+15% of accessible peers) but decrease our expected performance. We don't experience connectivity problems at the moment to justify significant performance degradation. Our use case (multiple short-lived connections) doesn't seem to benefit fully from DCUTR protocol.

Special thanks to @dtynn which showed the possibility of DCUTR solution in our case with the demo.

autonomys / subspace