dariusc93 / rust-ipfs

The InterPlanetary File System (IPFS), implemented in Rust.
Apache License 2.0
36 stars 7 forks source link

relay_manager.random_select() blocking forever #199

Closed ligustah closed 1 month ago

ligustah commented 1 month ago

Describe the bug Under some conditions the loop in random_select never exits. In my case, there appears to be only one connection for a given peer and that turns out to be on in the blacklist. Since this function is called from my event loop, it essentially kills the app.

To Reproduce I can't reliably reproduce, but I believe the issue happens when connections to a relay node drop (temporarily). I'm assuming that I am not correctly pushing in the right events and/or calling the method at the wrong time.

Expected behavior I would expect the function to not block, even when used potentially incorrectly.

Environment (please complete the following information)

dariusc93 commented 1 month ago

Thank you for reporting the bug. I do agree this does seem to happen at random and the select functionality could be better. I do plan on rewriting a large portion of the library though, however for this bug, we could do the following

diff --git a/packages/libp2p-relay-manager/src/lib.rs b/packages/libp2p-relay-manager/src/lib.rs
index 23266553..a33dd39e 100644
--- a/packages/libp2p-relay-manager/src/lib.rs
+++ b/packages/libp2p-relay-manager/src/lib.rs
@@ -46,7 +46,7 @@ pub enum Event {
     },
 }

-#[derive(Debug)]
+#[derive(Debug, Clone)]
 struct Connection {
     pub peer_id: PeerId,
     pub id: ConnectionId,
@@ -55,7 +55,7 @@ struct Connection {
     pub rtt: Option<[Duration; 3]>,
 }

-#[derive(Debug)]
+#[derive(Debug, Clone)]
 enum Candidate {
     Pending,
     Unsupported,
@@ -269,27 +269,41 @@ impl Behaviour {
             return;
         }

-        let mut blacklist = Vec::new();
+        let mut temp_connections = connections.clone();
         let mut rng = rand::thread_rng();
         let connection = loop {
-            let connection = connections
-                .choose_mut(&mut rng)
-                .expect("Connections to be available");
-
-            if blacklist.contains(&connection.id) {
-                continue;
+            if temp_connections.is_empty() {
+                self.events
+                    .push_back(ToSwarm::GenerateEvent(Event::ReservationFailure {
+                        peer_id,
+                        result: Box::new(std::io::Error::new(
+                            std::io::ErrorKind::Other,
+                            "no qualified connections available",
+                        )),
+                    }));
+                return;
             }

+            let connection = temp_connections
+                .choose(&mut rng)
+                .cloned()
+                .expect("Connection available");
+
             if let Candidate::Confirmed {
                 listener_id: Some(_),
                 ..
             } = connection.candidacy
             {
-                blacklist.push(connection.id);
+                // If the candidate is confirmed, remove it from the temporary connection
+                // and try another random connection, if any is available
+                temp_connections.retain(|c| c.id != connection.id);
                 continue;
             }

-            break connection;
+            break connections
+                .iter_mut()
+                .find(|c| c.id == connection.id)
+                .expect("Connection available");
         };

         if matches!(connection.candidacy, Candidate::Pending) {

That way if all possible relays are are depleted we emit an error to swarm. Long term, this function should return an error instead.

Thoughts?

ligustah commented 1 month ago

Thanks for the quick reply!

Having an escape hatch like that seems like the way to go. Thinking long term, if I may, I'd suggest getting rid of this functionality or at least making it optional. I've only skimmed the code so far, but I'm assuming the direction you are taking would be more towards fully automated relay selection, right?

Basically, include the behavior and whenever a relay is identified add it to the list and if autonat determines we are behind a router cycle through the relays as needed. Or if depending on autonat and identify is undesirable, reduce the API surface to basically: add_relay_address_candidate and enable_relay/disable_relay. From reading the code it seems like that's what you were aiming for and I think something like that would be great.

Again, thanks for being responsive and sharing your library : )