Closed shamil-gadelshin closed 6 months ago
Hi, after digging into the circuit relay
& dcutr
protocols,I am wondering what kind of relaying node below are you planning to implement and provide?
subspace-bootstrap-node
, as infrastructuressubspace-relay-node
, self-hosted by network participantssubspace-node
or subspace-farmer
instances with relay
optionally enabled by their ownersor is there any other solutions?
You have enumerated possible options. However, centralized infrastructure (like bootstrap nodes) is a single point of failure. It's not critical though because relays only improve network capabilities rather than provide core services. Embedded relay functions for nodes or farmers with strict limits seem like a more robust solution. Hybrid solutions are also possible, for example - we start testing with dedicated relay nodes, measure the effect, and upgrade with mass relays.
Thank you for responding.
I'm also thinking of embedded relay for nodes & farmers. for now, i have some basic assumptions:
I'm wondering if you guys have made some progress on this topic?
I've already looked through some related networking issues and prs but got little help, plz kindly remind me if i missed something.
a node should be either serverside or clientside, or even neither (relay disabled)
This status should be derived automatically using autonat protocol
clients with dcutr-like protocol should be high priored, or event required
I don't think libp2p supports QoS for protocols.
clients with too many relayed listen-ons declared should be denied
It's a possible security check. However, each relay will have a "connection limit per peer" as each peer has now. It will prevent reconnection from the same malicious peer.
I'm wondering if you guys have made some progress on this topic?
We didn't have any progress here besides adding autonat
.
This status should be derived automatically using autonat protocol
I mean, the behaviour shouuld be something like Either<libp2p::realy::Behaviour, libp2p::relay::client::Behaviour>
so that one node would not relay traffic while asking some peers to connect to it through other relayers.
I don't think libp2p supports QoS for protocols.
Yes, and this may require customized relay behaviour, for example, check the supported protocols in libp2p::identify::Event::Received
before handling libp2p::relay::handler::Event::ReservationReqReceived
.
It's a possible security check. However, each relay will have a "connection limit per peer" as each peer has now. It will prevent reconnection from the same malicious peer.
I was worried that a node make reservation request to every node it reaches, but after thinking about the situation once again, i realized that it seems to be a normal scene in p2p networking, so just forget this.
We didn't have any progress here besides adding
autonat
.
Ok, I will check the source codes, the examples and other projects depend on libp2p
.
Thank you very much for the information.
Hi
After reading rust-libp2p/examples/relay-server rust-libp2p/examples/dcutr go-libp2p/p2p/host/autorelay kubo/core/node/libp2p/relay.go,and testing, I have some thoughts about the hole-punching.
First, I would describe the two basic roles as:
relayer
: a peer which is public reachable and is willing to provide relay function for other peers.relay-client
: a peer that either: 1) not public reachable but would like to be relayed, or 2) would like to connect relayed peers.Some conclusions:
port_reuse
for tcp transport must be enabled to to allow dcutr
protocol for a relay-client.autonat::Event::StatusChanged { new: NatStatus::Public(addr), .. }
=> this node can be a relayer for other peers;autonat::Event::StatusChanged { new: NatStatus::Private, .. }
=> this node should be a relay-client.identify::Event::Received
:
observed_addr
should be added into swarm's external addresses, in order to allow relay reservation from othersrelay::protocol::HOP_PROTOCOL_NAME
, it would be recorded as a candidate relayerswarm::ConnectionEstablished { peer_id, endpoint: Dialer { address, role_override }, .. }
:
Enpoint::Dialer
=> address could be used to construct a circuit listen addr used by a relay-clientEnpoint::Listener
=> this connection is use the to negotiate roles in dcutr
protocol, and should be handled differently from above;I would start following this document and its stages: https://docs.libp2p.io/concepts/nat/hole-punching/
We have already added the first ingredient (autonat) for nat detection. The next logical step is enabling the relayed connections without DCUTR. Several things to consider:
identify
protocol isn't a source of external addresses (only candidates) - autonat
is.The end goal is to improve the network connectivity by including private farmer's caches in the global pool.
I'll clarify that the goal is to use relays ONLY for hole punching purposes and not to send traffic through those relays in case direct connection is possible. This is important.
potential relayers should have strict limits on connection number and preferably bandwidth to avoid abusing
There are some default configurable limits for relay protocol Config, but no directly bandwidth controls.
identify
protocol isn't a source of external addresses (only candidates) -autonat
is.
That's correct. My previous concern here is: what will the relayer act if 2 transports are enabled (tcp & quic) while only 1 public addr is reported by autonat::Event
or autonat::NatStatus
.
However, after reading the source codes, I think there will not be problem. I'll check again later.
tcp port reuse can't be enabled because it will lead to the "connection failed: IP-port is not unique" error because we can have several connections between peers.
~AFAIK, tcp port reuse
is the requirement for dcutr
.~
According to the source code, it seems that the port usage (the address used in negotiation) is not limited. But my own demo did not work if port_reuse was set to false. I'll do more research on this.
I'll clarify that the goal is to use relays ONLY for hole punching purposes and not to send traffic through those relays in case direct connection is possible. This is important.
I have the same view here: in our case, relay peers only help others to negotiate and get connected directly. But default dcutr
seem to be an optional choise, so we may have to implement our own dcutr-like protocol. Some assumptions were in the previous comment of this thread.
I'll clarify that the goal is to use relays ONLY for hole punching purposes and not to send traffic through those relays in case direct connection is possible. This is important.
It will require a workaround to stop relaying piece traffic before we add DCUTR but it seems possible.
My previous concern here is: what will the relayer act if 2 transports are enabled (tcp & quic)
FYI: we're disabling QUIC here: https://github.com/subspace/subspace/pull/2647
Again, about the port usage in dcutr:
FromSwarm::NewExternalAddrCandidate
, see candidates updateidentify
protocol, and are translated to use the specified listen port, see report new candidate / handle ToSwarm::NewExternalAddrCandidate / default addr translation used in tcp transportso, the listener port are always used in dcutr instead of the dialer port, and I think that's why port_reuse are required.
There maybe a workaround if we handle the dialer port mannually, but that seems a little bit complicated.
I'll clarify that the goal is to use relays ONLY for hole punching purposes and not to send traffic through those relays in case direct connection is possible. This is important.
It will require a workaround to stop relaying piece traffic before we add DCUTR but it seems possible.
Maybe we can just mannually close the circuit relayed connection on any dcutr error to enforce direct connections?
However, this method will not function if both ends disable dcutr. In this case, we can set max_circuit_duration
& max_circuit_bytes
more strictly to avoid abuse.
Setting number of bytes well under 1M will make it useless for piece retrieval, but should be sufficient for hole punching purposes.
I'm not sure I follow your 'port reuse' statements. We use the events you described as well as port translation, however, if you set port_reuse
flag for TCP transport it will affect the Dialer
and autonat
protocol will stop working. In the current DSN clients always connect to 'listen port' because of the port translation you mentioned. I think DCUTR
example uses port_reuse
to not implement port translation manually.
Using bandwidth limits to prevent piece transfer will lead to a massive amount of errors and significant network degradation (we should set the limits anyway). I suggest implementing a "connection-check" in the PieceProvider
from subspace_networking
- in the key loop - 1) Kademlia-lookup -> 2) Piece-request - we can add extra steps:
1) Kademlia-lookup -> 2) Peer connection type check -> 3) request connection upgrade if relayed connection (with timeout) -> 4) Piece-request.
Step 3) could be optimized: a) once a destination peer receives an inbound connection through a relay - it establishes a new direct connection and closes all relayed connections to the source peer b) source peer performs the connection check and waits (with timeout) for a direct connection without additional request.
We can either try to reuse DCUTR or as you suggested previously - implement our own similar protocol.
I'm not sure I follow your 'port reuse' statements.
A default DCUTR upgrading case:
A
dials to relay peer R
, and gets an observed_addr_of_A
with a random port from the underlying outbound connection, let's say: /ip4/some_public/tcp/61234/p2p/some_peer_id
. The observed_addr_of_A
is added as an external candidate.A
announces a relayed listen addr, and then peer B
connect to it through R
.A
proposes an upgrade upon the relayed conn, send the observed_addr_of_A
to B
, gets a candidate from B
. Let's assume that the candidate of B
is also not public accessible to make it simple.
Then A
start a tcp simulation: dial candidate addr of B
(open a new tcp conn with another random port which is obviously not 61234
) while waiting for an incoming dial from B
to 61234
.
A
and B
will always fail dialing to each other in this situation..
So,IMO, the main point is: A
should tell B
an addr for dialing in, which also means: A provides a port mapped by the route for B
to send SYN(the hole). If we disable the port_reuse flag, I think this will not happen.
Hope I make my thoughts clear. Correct me if I made any mistake.
if you set port_reuse flag for TCP transport it will affect the Dialer and autonat protocol will stop working.
Yes, port_reuse
will make autonat
confused when direct connections created by dcutr inbound. There may be some work arounds:
confidence_max
,For the hole-punching/port_reuse/autonat topic, I found another discussion from libp2p/specs: Consider only reusing TCP port when hole punching #389
Kademlia-lookup -> 2) Peer connection type check -> 3) request connection upgrade if relayed connection (with timeout) -> 4) Piece-request.
Maybe we can enforce the relay connections to upgrade to dcutr, regardless of what kind of requests they will send, so that we don't do any checks?
I think your workflow misses the address_translation
call in the ToSwarm::NewExternalAddrCandidate
event handler in the swarm code. The announced port won't be random it will be the listening port, at least it works like this now, I didn't test relayed configurations but I don't see a reason why it should work differently. Please, let me know if you spotted an error here.
Yes, port_reuse will make autonat confused when direct connections created by dcutr inbound.
Autonat will likely stop functioning correctly. Consider the following situation outside of the relay-dcutr case: a public peer A establishes a connection to another public peer B, having received a new connection - peer B will try to confirm its observed address. The current autonat settings lead to choosing the server from the connected peer list. B will request A to connect again (autonat probe). If we set port reuse then the probe will issue a second connection with a duplicate tuple (address A, port A, address B, port B) and fail with this reason. We encountered this error when were adding autonat.
I found a comment from DCUTR specification contributor without details in the discussion you noted: https://github.com/libp2p/specs/issues/389#issuecomment-1575532978
Maybe we can enforce the relay connections to upgrade to dcutr, regardless of what kind of requests they will send, so that we don't do any checks?
We likely can force the upgrade. But how will you prevent piece exchanges via relayed connection during the upgrade? When we issue a Kademlia request and establish a connection we immediately start getting a heavy piece from the remote peer. Did you mean another synchronization here?
I think your workflow misses the address_translation call in the ToSwarm::NewExternalAddrCandidate event handler in the swarm code. The announced port won't be random it will be the listening port, at least it works like this now, I didn't test relayed configurations but I don't see a reason why it should work differently. Please, let me know if you spotted an error here.
I think I know what we missed before:
port_reuse
to true, the actual point is to bind the local external candidate addr before connect. Otherwise, 1) will not be satisfied.By default (based on the tcp transport), only by enabling the port_reuse can make the following same:
That's why DCUTR only functions in this case (), according to here.
Even we make another address_translation
method, to let the exact observed_addr (without replacing the port part) be added as external candidate, the negotiation will fail, as a result of the missing binding part.
So if we can find an easy way to make the bind & connect
thing happen as wished, then we could forget the port_use
flag in the config. That means a custom (tcp) transport impl, & some config item like port_for_hole_punching
( or random picked on startup).
Autonat will likely stop functioning correctly. Consider the following situation outside of the relay-dcutr case: a public peer A establishes a connection to another public peer B, having received a new connection - peer B will try to confirm its observed address. The current autonat settings lead to choosing the server from the connected peer list. B will request A to connect again (autonat probe). If we set port reuse then the probe will issue a second connection with a duplicate tuple (address A, port A, address B, port B) and fail with this reason. We encountered this error when were adding autonat.
Agreed, and that's why I suggest drop the already known addrs from hole-punching for the autonat probe.
We likely can force the upgrade. But how will you prevent piece exchanges via relayed connection during the upgrade? When we issue a Kademlia request and establish a connection we immediately start getting a heavy piece from the remote peer. Did you mean another synchronization here?
I have no idea about this part now 😂 I will look into it later.
When we issue a Kademlia request and establish a connection we immediately start getting a heavy piece from the remote peer.
Can we wrap the kad::Behaviour
, ignore the established connection with relayed endpoints?
AFAIK, kad manages its connected peers based on other_established
in FromSwarm::ConnectionEstablished
& remaining_established
in FromSwarm::ConnectionClosed
. But it seems no harm if we construct dummy handlers for the relayed connections while handling events from them by default .
There's also one concern, is the kad protocol the only one we don't want on relayed connections?
I started a discussion in the upstream repository with both our questions about DCUTR and autonat: https://github.com/libp2p/rust-libp2p/discussions/5291
When we issue a Kademlia request and establish a connection we immediately start getting a heavy piece from the remote peer.
Can we wrap the
kad::Behaviour
, ignore the established connection with relayed endpoints?AFAIK, kad manages its connected peers based on
other_established
inFromSwarm::ConnectionEstablished
&remaining_established
inFromSwarm::ConnectionClosed
. But it seems no harm if we construct dummy handlers for the relayed connections while handling events from them by default .There's also one concern, is the kad protocol the only one we don't want on relayed connections?
Kademlia is not an issue here because it uses a rather small number of bytes. I meant this line: https://github.com/subspace/subspace/blob/443b30652f64da1d91fe630758d5ee4168b565b9/crates/subspace-networking/src/utils/piece_provider.rs#L72
Piece requests consume the main traffic.
When we issue a Kademlia request and establish a connection we immediately start getting a heavy piece from the remote peer.
Can we wrap the
kad::Behaviour
, ignore the established connection with relayed endpoints? AFAIK, kad manages its connected peers based onother_established
inFromSwarm::ConnectionEstablished
&remaining_established
inFromSwarm::ConnectionClosed
. But it seems no harm if we construct dummy handlers for the relayed connections while handling events from them by default . There's also one concern, is the kad protocol the only one we don't want on relayed connections?Kademlia is not an issue here because it uses a rather small number of bytes. I meant this line:
Piece requests consume the main traffic.
Ok, it finally calls kademlia.get_record
. I'll check it.
I started a discussion in the upstream repository with both our questions about DCUTR and autonat: libp2p/rust-libp2p#5291
To keep the other issue clean, I just comment here.
It seems that iroh-p2p uses port_reuse as well in the latest code, see build_transport.
Did you see any previous version that enables DCUTR without port_reuse? Please share the commit / tag / branch, so maybe I can test the code.
It seems that iroh-p2p uses port_reuse as well in the latest code, see build_transport.
Hmm, It's correct. My local iroh code turned out to be very old. That version had both dcutr and no port_reuse.
Ok, it finally calls kademlia.get_record. I'll check it.
FYI: We don't use get_record
API - we use get_providers
Ok, it finally calls kademlia.get_record. I'll check it.
FYI: We don't use
get_record
API - we useget_providers
Got that.
I followed this line.
However, they all fell on the handle_command call, and I think there won't be much difference. I'll be more careful next time.
Hi, I wrote a demo to show one possible way of solving the holepunch vs autonat problem, and avoiding any kad traffic based on relayed connections.
The demo includes:
Protocol::P2pWebRtcDirect
(see here), and a HolePunchTransport to handle it;dcutr::Behaviour
to only let addrs in 1) through;autonat::Behaviour
to ignore addrs in 1);kad::Behaviour
to ignore relayed connections.A minimum showcase group of peers can be started by:
// 1. a peer to provide relay function
RUST_LOG=info ./libp2p-relay-demo --listen-port <port1> --seed 1 --relay-service
// 2. a peer helping autonat detection
RUST_LOG=info ./libp2p-relay-demo --listen-port <port2> --seed 2 --connect=/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X
// 3. a peer behind nat/firewall with kad enabled
RUST_LOG=info ./libp2p-relay-demo --listen-port <port3> --seed 3 --connect=/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X --listen-relayed --dcutr-port=<dcutr_port_of_peer3> --kad
// 4. a peer connected to 3) and send a kad put
RUST_LOG=info ./libp2p-relay-demo --listen-port <port4> --seed 4 --connect="/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X" --peer="/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X/p2p-circuit/p2p/12D3KooWQYhTNQdmr3ArTeUHRYzFg94BKyTkoWBDWez9kSCVe2Xo" --dcutr-port=<dcutr_port_of_peer4> --kad --kad-put="ka:va"
// 5. a peer connected to 3) and send a kad get
RUST_LOG=info ./libp2p-relay-demo --listen-port <port5> --seed 5 --connect="/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X" --peer="/ip4/<public_ip_of_peer_1>/tcp/<port1>/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X/p2p-circuit/p2p/12D3KooWQYhTNQdmr3ArTeUHRYzFg94BKyTkoWBDWez9kSCVe2Xo" --dcutr-port=<dcutr_port_of_peer5> --kad --kad-get="ka"
First of all, thank you for the demo - I appreciate the effort.
Did you try it within a distributed cluster (i.e. AWS + home)?
I understood your solution as follows:
If there are other key pieces, please, feel free to add them.
1) Is this description of the process correct?
Peer A (private), peer B(any), peer R(public relay), other peers.
A listens on DCUTR port and "regular port".
We don't need to protect Kademlia routing table from relayed addresses using wrapper. We use manual Kademlia mode and add addresses with filtering already - we can just add another filter.
Did you try adding a custom string and custom protocol parsing instead of obsolete protocol? There are many reasons to avoid using obsolete protocols (the obvious one - Protocol Labs will remove it upstream). There is an option to create a fork and add the new protocol manually but its better to avoid forks. Another question - do you think its possible to implement this separation using application level protocol like ping?
Did you try it within a distributed cluster (i.e. AWS + home)?
Yes, I myself started the showcase group of peers above on two machines for testing:
The peers worked as expected.
- Is this description of the process correct?
Most are correct except 2 points:
--dcutr-port=<some port>
, and peer A listens /ip4/0.0.0.0/tcp/<dcutr port>/p2p-webrtc-direct
on start up, this addr is handled by HolePunchTransport
.--peer=<some peer addr>
, but i think it's possible to find the peer from within the network (since the logic of all the protocols are not changed).The main reason for 1) is: The codes of the tcp socket creation(connect & bind) is hiden deeply inside the tcp transport's implement, so I decided reuse them (at least in this demo).
That is to say, for now, the HolePunchTransport is just a simple wrapper that receives self-defined direct addresses
in both listening & dialing, translates them to normal addrs, and passes them to the underlying tcp::Transport
.
The bad part of this decision is: all the peers with dcutr enabled in my demo actually listen to 2 ports (one regular & one for dcutr). However, if we finally decide not to listen any other ports except the regular one, we can achive it quick and simple by impl a totally customized transport.
Did you try adding a custom string and custom protocol parsing instead of obsolete protocol?
Yes, I tried to do so, but all the protocols that can be used in multiaddr are pre-defined as a enum and there are no such things like Protocol::Custom(String)
or a trait MultiaddrParser
.
Also we can do nothing to the Multiaddr::iter
part.
Another question - do you think its possible to implement this separation using application level protocol like ping.
Do you mean, if we can format a multiaddr with application level protocol
, get something like /ip4/<some host ip>/tcp/<some port>/ping
?
That's impossible. The two so called protocol
s are not the same.
That's impossible. The two so called protocols are not the same.
Application level protocol can manage connections. Track existing and start new ones. Potentially, it can contain all the logic that you have spread between swarm and behavior events. I wonder, are there any drawbacks to using that. Maybe it's too inconvenient.
@nazar-pc What do you think about having an additional separate port for DCUTR?
Application level protocol can manage connections. Track existing and start new ones. Potentially, it can contain all the logic that you have spread between swarm and behavior events. I wonder, are there any drawbacks to using that. Maybe it's too inconvenient.
Ah, you mean implement a new appilcation level protocol
(NetworkBehaviour
) to manage all the holepunch stuffs.That's possible.
The only problem is the cost & difficulty to maintain our customized Transport
s & NetworkBehaviour
s. For now, I didn't see much trouble.
The demo was aimed at finding the key points & edge cases, and it is obviously not the final solution.
@nazar-pc What do you think about having an additional separate port for DCUTR?
I don't like it to be honest and after skimming above it will not help with hole punching on the ports we want, it would basically be a separate port with 1 connection per peer on that port, is that correct? If so then this is not how it should work IMO.
I think I should list some conclusions here after all these long comments.
tcp::Transport
.NewExternalAddrCandidate
s as there are other protocols based on them.In above, the hole punching
is not equal to the dcutr
protocl provided by libp2p.
First of all, the basic progress:
If we don't want to change the progress too much, then we should:
Transport
that can bind a specific local addr to the socket before connecting.NetworkBehaviour
that handles the hole punching progress.I suggest i) or ii)
Transport
about what local port we want to bind in some specific dialing action?
/ip4/<some ip>/tcp/<some port>/direct
, use them in (virtual) listening on start up, and in dialing for hole punching.Transport
about all the informations for socket creation only when we call dial
method, then a format like /ip4/0.0.0.0/tcp/<local port>/direct/ip4/<remote ip>/tcp/<remote port>
should be okIf you guys agree with the above, or tell me what you prefer, I think I can complete a new demo in a couple of days. Then we can start another round of discussions about the details and to see if we can integrate the codes into subspace-networking
After the research conducted with a major contribution from @dtynn we chose to not proceed with DCUTR at this moment.
Protocol labs global experiment with punchr reported 60-80% of success rate with DCUTR with both TCP and QUIC protocols. TCP success rate (25%) is much worse than with QUIC (75%).
Our last network measurement using random-walker
(24h) found 1/3 of publicly accessible peers.
If we take a total TCP DCUTR success rate as 20% we can potentially add 50% of new publicly accessible peers which will significantly improve our connectivity.
We can't switch to QUIC protocol to improve the DCUTR success rate because it causes router problems in some setups. We disabled it recently: https://github.com/subspace/subspace/pull/2647
DCUTR relies on relayed connections and we don't want relayers to route heavy traffic (pieces) and it updates our main data acquiring loop with the connection type check: a) get-providers -> b) detect connection type and wait for connection upgrade if relayed -> c) get piece.
"Connection detection" step b) will result in waiting in 66% of cases after adding DCUTR (because of the massive addition of possible relayed connections) and it will fail in 80% of cases because of the projected TCP DCUTR success rate. Multiple connection errors will require increasing our retries number. The piece acquiring time will likely increase significantly maybe even 2-3 times.
Having the perfect implementation we will have a more robust network (+15% of accessible peers) but decrease our expected performance. We don't experience connectivity problems at the moment to justify significant performance degradation. Our use case (multiple short-lived connections) doesn't seem to benefit fully from DCUTR protocol.
Special thanks to @dtynn which showed the possibility of DCUTR solution in our case with the demo.
Relates to https://github.com/subspace/subspace/issues/1022