Closed teschmitt closed 10 months ago
Thank you for taking the time to report this issue and provide an easy to use minimal scenario.
But I'm not sure how to reproduce your errors.
When I run this scenario in my machine with M=1000
I get the following output:
--------------------------------------------------------------------
SIMULATION RESULTS
--------------------------------------------------------------------
Found connection errors on n2: 0
Message stats:
node | sent | recvd
n1 | 1000 | 1000
n2 | ---- | 2000
n3 | 1000 | 1000
I did have to add cp /shared/bin/* /usr/local/bin
to the pre hook in your experiment.conf
but without this dtnsend
should never be possible as the binary will be missing in the coreemu-lab image.
For M=5000
I get these results:
--------------------------------------------------------------------
SIMULATION RESULTS
--------------------------------------------------------------------
Found connection errors on n2: 0
Message stats:
node | sent | recvd
n1 | 5000 | 5000
n2 | ---- | 10000
n3 | 5000 | 5000
There might be an issue on your local machine running the docker container. Did you try this on multiple machines?
Ok I've cross-checked this issue on another machine (Linux VM running on an M1 Macbook) and I could not reproduce the error until I loaded the ebtables
kernel module. As a matter of fact, this module was also loaded on the machine I originally encountered this error on.
Often but not always, all bundles were transmitted successfully, but there are always connection errors in the dtnd
logs. Taking a look at the logs, I can see that these errors cause long pauses in transmission.
Might this be some sort of feud between ebtables
and dtn7-rs
?
Here is the Dockerfile
I used to run these experiments:
FROM rust:1.62.1 as builder
WORKDIR /root
RUN cargo install --locked --bins --examples --root /usr/local --git https://github.com/dtn7/dtn7-rs --rev 0bd550ce dtn7
FROM gh0st42/coreemu-lab:1.0.0
COPY --from=builder /usr/local/bin/* /usr/local/bin/
RUN echo "export USER=root" >> /root/.bashrc
ENV USER root
EXPOSE 22
EXPOSE 5901
EXPOSE 50051
EDIT: both ebtables
and sch_netem
are loaded when the errors crop up:
$ lsmod | grep -E "ebtables|sch_netem"
ebtables 45056 1 ebtable_filter
sch_netem 20480 0
as i cannot easily reproduce the problem and it does not happen on other machines, I will close the issue now. If you get new insights or can reproduce the bug please reopen the issue.
I am using the coreemu-lab to simulate a scenario involving three nodes, using one node to ferry bundles between the other two. All nodes run
dtnd
with the same arguments:Assuming there are three nodes
n1
(IP: 10.0.0.1),n2
(10.0.0.2), andn3
(10.0.0.3) of whichn1
andn3
are pre-loaded with a certain number of messages (M) for the group endpoint that is shared by all nodes. Initially, all nodes have no connection between them:After 15 seconds,
n2
moves into range ofn1
and receives M bundles:After further 15 seconds,
n2
moves into range ofn3
where it stays until the end of the simulation at T+120s. Here, it should receive M bundles fromn3
and forward M bundles originating fromn1
ton3
:Dependent on M,
n2
will exhibit faulty behavior. E.g. for M=1000, we get the following bundle transfer stats:A look at the
dtnd
log fromn2
andn3
shows us that after neighbor discovery at about T+33s,n2
sends all bundles originating fromn1
ton3
, but only processes 67 bundles fromn3
and then just idles until the end of the experiment. Logs onn3
show all bundles have been sent ton2
.With M=5000, connection errors start popping up as
n2
tries to forward bundles ton1
at 10.0.0.1 long after it has gone out of range. This causes long stalls of about 35 to 50 secs in the sending process in which thedtnd
freezes. Here are two consecutive log entries whose timestamps show the stall duration:So the idling until the end of the experiment in M=1000 could actually just be a stall that gets interrupted because the experiment has run out of time. Weirdly enough, the M=5000 setup actually sees all bundles transferred completely, even in the presence of the errors and stalls.
I've attached the scenario setup. M can be regulated through
NUMMSGS
inexperiment.conf
. Also included are the logs referenced in this issue: connerr.zip