Unprocessed bundles and connection errors in 3-node scenario

teschmitt commented 1 year ago

I am using the coreemu-lab to simulate a scenario involving three nodes, using one node to ferry bundles between the other two. All nodes run dtnd with the same arguments:

dtnd --cla mtcp --nodeid $(hostname) --endpoint "dtn://this-is-our/~group" --interval 2s --janitor 3s --peer-timeout 4s

Assuming there are three nodes n1 (IP: 10.0.0.1), n2 (10.0.0.2), and n3 (10.0.0.3) of which n1 and n3 are pre-loaded with a certain number of messages (M) for the group endpoint that is shared by all nodes. Initially, all nodes have no connection between them:

[n1]         [n3]

[n2]

After 15 seconds, n2 moves into range of n1 and receives M bundles:

[n1]         [n3]
 |
[n2]

After further 15 seconds, n2 moves into range of n3 where it stays until the end of the simulation at T+120s. Here, it should receive M bundles from n3 and forward M bundles originating from n1 to n3:

[n1]   [n2]--[n3]

Dependent on M, n2 will exhibit faulty behavior. E.g. for M=1000, we get the following bundle transfer stats:

node | sent | recvd
  n1 | 1000 |     0
  n2 | ---- |  1067
  n3 | 1000 |  1000

A look at the dtnd log from n2 and n3 shows us that after neighbor discovery at about T+33s, n2 sends all bundles originating from n1 to n3, but only processes 67 bundles from n3 and then just idles until the end of the experiment. Logs on n3 show all bundles have been sent to n2.

With M=5000, connection errors start popping up as n2 tries to forward bundles to n1 at 10.0.0.1 long after it has gone out of range. This causes long stalls of about 35 to 50 secs in the sending process in which the dtnd freezes. Here are two consecutive log entries whose timestamps show the stall duration:

...
2022-12-28T22:25:16.315Z INFO  dtn7::core::processing          > [inconspicuous log entry]
2022-12-28T22:26:24.574Z ERROR dtn7::cla::mtcp                 > Error connecting to remote 10.0.0.1:16162
...

So the idling until the end of the experiment in M=1000 could actually just be a stall that gets interrupted because the experiment has run out of time. Weirdly enough, the M=5000 setup actually sees all bundles transferred completely, even in the presence of the errors and stalls.

I've attached the scenario setup. M can be regulated through NUMMSGS in experiment.conf. Also included are the logs referenced in this issue: connerr.zip

gh0st42 commented 1 year ago

Thank you for taking the time to report this issue and provide an easy to use minimal scenario.

But I'm not sure how to reproduce your errors. When I run this scenario in my machine with M=1000 I get the following output:

--------------------------------------------------------------------
SIMULATION RESULTS
--------------------------------------------------------------------
Found connection errors on n2: 0

Message stats:
node | sent | recvd
  n1 | 1000 |  1000
  n2 | ---- |  2000
  n3 | 1000 |  1000

I did have to add cp /shared/bin/* /usr/local/bin to the pre hook in your experiment.conf but without this dtnsend should never be possible as the binary will be missing in the coreemu-lab image.

For M=5000 I get these results:

--------------------------------------------------------------------
SIMULATION RESULTS
--------------------------------------------------------------------
Found connection errors on n2: 0

Message stats:
node | sent | recvd
  n1 | 5000 |  5000
  n2 | ---- | 10000
  n3 | 5000 |  5000

There might be an issue on your local machine running the docker container. Did you try this on multiple machines?

teschmitt commented 1 year ago

Ok I've cross-checked this issue on another machine (Linux VM running on an M1 Macbook) and I could not reproduce the error until I loaded the ebtables kernel module. As a matter of fact, this module was also loaded on the machine I originally encountered this error on.

Often but not always, all bundles were transmitted successfully, but there are always connection errors in the dtnd logs. Taking a look at the logs, I can see that these errors cause long pauses in transmission.

Might this be some sort of feud between ebtables and dtn7-rs?

Here is the Dockerfile I used to run these experiments:

FROM rust:1.62.1 as builder
WORKDIR /root
RUN cargo install --locked --bins --examples --root /usr/local --git https://github.com/dtn7/dtn7-rs --rev 0bd550ce dtn7

FROM gh0st42/coreemu-lab:1.0.0
COPY --from=builder /usr/local/bin/* /usr/local/bin/
RUN echo "export USER=root" >> /root/.bashrc
ENV USER root
EXPOSE 22
EXPOSE 5901
EXPOSE 50051

EDIT: both ebtables and sch_netem are loaded when the errors crop up:

$ lsmod | grep -E "ebtables|sch_netem"
ebtables               45056  1 ebtable_filter
sch_netem              20480  0

gh0st42 commented 10 months ago

as i cannot easily reproduce the problem and it does not happen on other machines, I will close the issue now. If you get new insights or can reproduce the bug please reopen the issue.

dtn7 / dtn7-rs

Unprocessed bundles and connection errors in 3-node scenario #45