`rch::mpsc::Distributor` leads to easy loss of messages

robinhundt commented 2 years ago

First, thanks for this great crate! It seems like it is exactly the abstraction I needed.

I'm currently refactoring some code to using remoc. However, one of my test cases was spuriously failing. I tracked it down to my usage of the Distributor API. A very simplified example looks like this:

  use remoc::rch::mpsc::{self, channel};

  pub type Sender<T> = mpsc::Sender<T, remoc::codec::Bincode>;
  pub type Receiver<T> = mpsc::Receiver<T, remoc::codec::Bincode>;

  async fn send_task(sender: Sender<()>) {
      for _ in 0..16 {
          sender.send(()).await.unwrap();
      }
  }

  async fn receive_task(receiver: Receiver<()>) {
      let distributor = receiver.distribute(false);
      for _ in 0..16 {
          let (mut receiver, handle) = distributor.subscribe().await.unwrap();
          receiver.recv().await.unwrap().unwrap();
          dbg!("Received value");
          handle.remove();
      }
  }

  #[cfg(test)]
  mod tests {
      use super::*;

      #[tokio::test]
      async fn lost_messages() {
          let (sender, receiver) = channel(64);
          tokio::join!(send_task(sender), receive_task(receiver));
      }
  }

The test case fails with a panic on the second unwrap of receiver.recv().await.unwrap().unwrap(); in the receive_task after a few iterations. I've looked at the source code and it seems, that the distributor continuously tries to get a permit for one of the DistributedReceivers and then recvs a message on the original receiver which is sent to the permit.

I expected the Distributor to effectively turn the mpsc channel into a mpmc channel. But the behaviour is different in a crucial way. With an mpmc channel like the one offered by async_channel items can't be lost by receivers for which recv is not called.

Example of using async_channel

```rust use async_channel::{Receiver, Sender}; async fn send_task(sender: Sender<()>) { for _ in 0..16 { sender.send(()).await.unwrap(); } } async fn receive_task(receiver: Receiver<()>) { for _ in 0..16 { let receiver_clone = receiver.clone(); receiver_clone.recv().await.unwrap(); dbg!("Received value"); } } #[cfg(test)] mod tests { use super::*; #[tokio::test] async fn no_lost_messages() { let (sender, receiver) = async_channel::bounded(64); tokio::join!(send_task(sender), receive_task(receiver)); } } ```

I'm not sure how or if this behaviour could be changed. If this is the expected behaviour of the distributor, it should probably be clarified in the docs.

I've created a repo here with the code samples to reproduce.

surban commented 2 years ago

Yes, your observation is right. The distributor does not take any measures to ensure that a sent message is indeed received.

We could provide that guarantee in Remoc: for example, the distributor could expect a confirmation message from the receiver and resend the original message to another receiver if it was lost. However, in a distributed system there would still be no guarantee that the received message has been handled (by the program using Remoc); indeed the process receiving the message might crash immediately after receiving and confirming reception of the message.

Thus my position on this issue is that message confirmation must involve some user code, i.e. the message receiver should confirm reception of the message after it has received and handled it. One possibility to do so is to put a oneshot sender of unit data type into each message you send and use it to confirm handling of the received message.

Since async_channel is only handling communication within a single process, they don't have to deal with this issue. I.e. their implicit assumption is that a crash (panic, process termination, etc.) in the receiver will take the sender along with it.

If you have a clever solution for this issue I'd love to hear it. Otherwise I will document this as intended behavior.

BTW, what is your intended use case? Sounds like you need a distributed work queue of some sort. We could think about introducing some functionality for that if it is sufficiently general.

surban commented 1 year ago

Closing since the observed behavior is by-design and intended.

ENQT-GmbH / remoc

`rch::mpsc::Distributor` leads to easy loss of messages #13