eclipse-zenoh / zenoh

zenoh unifies data in motion, data in-use, data at rest and computations. It carefully blends traditional pub/sub with geo-distributed storages, queries and computations, while retaining a level of time and space efficiency that is well beyond any of the mainstream stacks.
https://zenoh.io
Other
1.5k stars 160 forks source link

Closing Quic session is inconsistent #1019

Open oteffahi opened 6 months ago

oteffahi commented 6 months ago

Describe the bug

When closing a Zenoh session, it is expected that the socket is available after the session.close() call returns. This should be observed independently of which Link protocol is used by the session. This is not the case sometimes when using Quic.

Testing locally has shown this to be true on TCP, UDP and TLS. However, sequentially closing and opening a router/peer Quic session leads to panic due to the socket being unavailable. This only occurs when the session in question has been used (i.e. a client or peer has established a connection on that socket), this case is therefore not covered by this test which only opens one router session and alternates open/close of client sessions that connect to it.

To reproduce

#[tokio::main]
async fn main() {

  {...Initializations and config...}

  println!("Opening Router Session...");
  let router = zenoh::open(router_config.clone())
      .res_async()
      .await
      .unwrap();
  println!("Opening Client Session...");
  let client = zenoh::open(client_config.clone())
      .res_async()
      .await
      .unwrap();
  println!("Closing Client Session...");
  client.close().res_async().await.unwrap();
  println!("Closing Router Session...");
  router.close().res_async().await.unwrap();

  println!("Reopening Router Session...");
  // This will panic most of the time if the Transport is using Quic unicast Link
  // It does not panic when using TLS, TCP or UDP links (tested locally)
  let s2 = zenoh::open(router_config.clone())
      .res_async()
      .await
      .unwrap();
}

System info

MacBook Air, Mac OS, Apple Silicon M2 Zenoh 0.11.0-dev

Mallets commented 5 months ago

Can you try adding a task::sleep(Duration:from_secs(1) after the router.close()? It's not uncommon that the API returns but the system/library still takes a bit to complete the cleanup under the hood.

oteffahi commented 5 months ago

Can you try adding a task::sleep(Duration:from_secs(1) after the router.close()? It's not uncommon that the API returns but the system/library still takes a bit to complete the cleanup under the hood.

Even with a sleep, the time it takes to free the socket is inconsistent and can take more than five seconds in some unlucky runs.

Mallets commented 5 months ago

This test verifies that links can be open/closed in a loop. I.e. very similar to the scenario you are describing. Do you observe the same behaviour if you run the test manually?

oteffahi commented 5 months ago

This test verifies that links can be open/closed in a loop. I.e. very similar to the scenario you are describing. Do you observe the same behaviour if you run the test manually?

Adding and removing the listener, or opening and closing the listener's session alone is not enough to trigger this issue. In my tests I am able to replicate this only when some other session is created and connected to the listener endpoint, as demonstrated in the example I gave above. EDIT: I realize I didn't give the configs in my example. The router is the listener, and the client connects to it.