Closed FlorianReimold closed 1 year ago
Current (workaround?) branch is here: https://github.com/eclipse-ecal/ecal/tree/hotfix/memfile_ack_timeout_5_11
Currently, this branch only increases the internal timeout making it more unlikely to miss it.
No idea why, but this is what has happened. Could be the testing tool ecal_perftool
, not sure, yet. But for sure, the ack_timeout
also doesn not what it is supposed to do.
As always, the issue created itself after some time.
The test was performed with the current HEAD of https://github.com/eclipse-ecal/ecal/tree/hotfix/shm_data_drop This shows, that there still is an issue. We haven't fixed it, but we have relieved the effect.
1 x publisher
./ecal_perftool pub topic 25 1024000
[ 4355.032] | cnt: 9 | mean_loop_dt (ms): 101.478 | loop_freq (Hz): 9.9 | mean_snd_dt (ms): 101.401
[ 4356.033] | cnt: 10 | mean_loop_dt (ms): 101.202 | loop_freq (Hz): 9.9 | mean_snd_dt (ms): 101.153
[ 4357.033] | cnt: 10 | mean_loop_dt (ms): 101.278 | loop_freq (Hz): 9.9 | mean_snd_dt (ms): 101.333
[ 4358.034] | cnt: 10 | mean_loop_dt (ms): 101.408 | loop_freq (Hz): 9.9 | mean_snd_dt (ms): 101.473
[ 4359.122] | cnt: 5 | mean_loop_dt (ms): 101.102 | loop_freq (Hz): 9.9 | mean_snd_dt (ms): 101.182
[ 4360.123] | cnt: 14 | mean_loop_dt (ms): 121.973 | loop_freq (Hz): 8.2 | mean_snd_dt (ms): 92.370
[ 4361.124] | cnt: 13 | mean_loop_dt (ms): 78.496 | loop_freq (Hz): 12.7 | mean_snd_dt (ms): 49.202
[ 4362.124] | cnt: 14 | mean_loop_dt (ms): 75.455 | loop_freq (Hz): 13.3 | mean_snd_dt (ms): 46.485
[ 4363.124] | cnt: 13 | mean_loop_dt (ms): 78.439 | loop_freq (Hz): 12.7 | mean_snd_dt (ms): 49.053
[ 4364.125] | cnt: 14 | mean_loop_dt (ms): 75.445 | loop_freq (Hz): 13.3 | mean_snd_dt (ms): 46.402
4 x 10ms subscriber:
./ecal_perftool sub topic 10
[ 4355.032] | cnt: 10 | lost: 0 | msg_dt_mean (ms): 101.383 | msg_freq (Hz): 9.9
[ 4356.033] | cnt: 10 | lost: 0 | msg_dt_mean (ms): 101.252 | msg_freq (Hz): 9.9
[ 4357.033] | cnt: 10 | lost: 0 | msg_dt_mean (ms): 101.317 | msg_freq (Hz): 9.9
[ 4358.034] | cnt: 10 | lost: 0 | msg_dt_mean (ms): 101.439 | msg_freq (Hz): 9.9
[ 4359.124] | cnt: 5 | lost: 0 | msg_dt_mean (ms): 101.091 | msg_freq (Hz): 9.9
[ 4360.124] | cnt: 14 | lost: 0 | msg_dt_mean (ms): 75.396 | msg_freq (Hz): 13.3
[ 4361.124] | cnt: 13 | lost: 0 | msg_dt_mean (ms): 78.415 | msg_freq (Hz): 12.8
[ 4362.125] | cnt: 14 | lost: 0 | msg_dt_mean (ms): 75.511 | msg_freq (Hz): 13.2
[ 4363.125] | cnt: 13 | lost: 0 | msg_dt_mean (ms): 78.357 | msg_freq (Hz): 12.8
[ 4364.125] | cnt: 14 | lost: 0 | msg_dt_mean (ms): 75.406 | msg_freq (Hz): 13.3
1 x 10ms subscriber that behaves substantially different and suddenly received more than what is published!
./ecal_perftool sub topic 10
[ 4355.365] | cnt: 10 | lost: 0 | msg_dt_mean (ms): 101.324 | msg_freq (Hz): 9.9
[ 4356.365] | cnt: 10 | lost: 0 | msg_dt_mean (ms): 101.035 | msg_freq (Hz): 9.9
[ 4357.366] | cnt: 9 | lost: 0 | msg_dt_mean (ms): 101.414 | msg_freq (Hz): 9.9
[ 4358.367] | cnt: 10 | lost: 0 | msg_dt_mean (ms): 101.375 | msg_freq (Hz): 9.9
[ 4359.367] | cnt: 4 | lost: 0 | msg_dt_mean (ms): 262.172 | msg_freq (Hz): 3.8
[ 4360.368] | cnt: 13 | lost: 0 | msg_dt_mean (ms): 40.071 | msg_freq (Hz): 25.0
[ 4361.368] | cnt: 13 | lost: 0 | msg_dt_mean (ms): 40.004 | msg_freq (Hz): 25.0
[ 4362.369] | cnt: 14 | lost: 0 | msg_dt_mean (ms): 40.013 | msg_freq (Hz): 25.0
[ 4363.370] | cnt: 13 | lost: 0 | msg_dt_mean (ms): 40.063 | msg_freq (Hz): 25.0
[ 4364.370] | cnt: 14 | lost: 0 | msg_dt_mean (ms): 40.060 | msg_freq (Hz): 25.0
1 x 100ms subscriber
./ecal_perftool sub topic 100
[ 4355.032] | cnt: 9 | lost: 0 | msg_dt_mean (ms): 101.366 | msg_freq (Hz): 9.9
[ 4356.033] | cnt: 10 | lost: 0 | msg_dt_mean (ms): 101.189 | msg_freq (Hz): 9.9
[ 4357.033] | cnt: 10 | lost: 0 | msg_dt_mean (ms): 101.325 | msg_freq (Hz): 9.9
[ 4358.034] | cnt: 10 | lost: 0 | msg_dt_mean (ms): 101.391 | msg_freq (Hz): 9.9
[ 4359.124] | cnt: 6 | lost: 0 | msg_dt_mean (ms): 101.162 | msg_freq (Hz): 9.9
[ 4360.124] | cnt: 6 | lost: 4 | msg_dt_mean (ms): 168.551 | msg_freq (Hz): 5.9
[ 4361.124] | cnt: 6 | lost: 5 | msg_dt_mean (ms): 176.479 | msg_freq (Hz): 5.7
[ 4362.125] | cnt: 6 | lost: 5 | msg_dt_mean (ms): 175.789 | msg_freq (Hz): 5.7
[ 4363.125] | cnt: 7 | lost: 5 | msg_dt_mean (ms): 153.267 | msg_freq (Hz): 6.5
[ 4364.125] | cnt: 6 | lost: 5 | msg_dt_mean (ms): 176.047 | msg_freq (Hz): 5.7
eCAL Mon shows a little different picture:
processes {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5279
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
pparam: "./ecal_perftool pub topic 25 1024000"
pmemory: 435613696
usrptime: -1
datawrite: 14336000
state {
}
tsync_state: tsync_realtime
tsync_mod_name: "\"ecaltime-localtime\""
component_init_state: 183
component_init_info: "pub|sub|srv|log|time"
ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
hgname: "florian-ubuntu20"
}
processes {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5290
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
pparam: "./ecal_perftool sub topic 10"
pmemory: 585547776
usrptime: -1
dataread: 14336000
state {
}
tsync_state: tsync_realtime
tsync_mod_name: "\"ecaltime-localtime\""
component_init_state: 183
component_init_info: "pub|sub|srv|log|time"
ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
hgname: "florian-ubuntu20"
}
processes {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5303
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
pparam: "./ecal_perftool sub topic 10"
pmemory: 510046208
usrptime: -1
dataread: 14336000
state {
}
tsync_state: tsync_realtime
tsync_mod_name: "\"ecaltime-localtime\""
component_init_state: 183
component_init_info: "pub|sub|srv|log|time"
ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
hgname: "florian-ubuntu20"
}
processes {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5315
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
pparam: "./ecal_perftool sub topic 10"
pmemory: 510046208
usrptime: -1
dataread: 14336000
state {
}
tsync_state: tsync_realtime
tsync_mod_name: "\"ecaltime-localtime\""
component_init_state: 183
component_init_info: "pub|sub|srv|log|time"
ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
hgname: "florian-ubuntu20"
}
processes {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5327
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
pparam: "./ecal_perftool sub topic 10"
pmemory: 510046208
usrptime: -1
dataread: 14336000
state {
}
tsync_state: tsync_realtime
tsync_mod_name: "\"ecaltime-localtime\""
component_init_state: 183
component_init_info: "pub|sub|srv|log|time"
ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
hgname: "florian-ubuntu20"
}
processes {
rclock: 5482
hname: "florian-ubuntu20"
pid: 5353
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
pparam: "./ecal_perftool sub topic 10"
pmemory: 510046208
usrptime: -1
dataread: 14336000
state {
}
tsync_state: tsync_realtime
tsync_mod_name: "\"ecaltime-localtime\""
component_init_state: 183
component_init_info: "pub|sub|srv|log|time"
ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
hgname: "florian-ubuntu20"
}
processes {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5373
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
pparam: "./ecal_perftool sub topic 100"
pmemory: 510046208
usrptime: -1
dataread: 6144000
state {
}
tsync_state: tsync_realtime
tsync_mod_name: "\"ecaltime-localtime\""
component_init_state: 183
component_init_info: "pub|sub|srv|log|time"
ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
hgname: "florian-ubuntu20"
}
processes {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5436
pname: "/usr/bin/ecal_mon_gui-5.12.0"
uname: "eCALMon"
pparam: "/usr/bin/ecal_mon_gui"
pmemory: 934182912
usrptime: -1
state {
severity: proc_sev_healthy
info: "Running"
}
tsync_state: tsync_realtime
tsync_mod_name: "\"ecaltime-localtime\""
component_init_state: 191
component_init_info: "pub|sub|srv|mon|log|time"
ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
hgname: "florian-ubuntu20"
}
topics {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5279
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
tid: "1624945319207"
tname: "topic"
direction: "publisher"
tlayer {
type: tl_ecal_shm
confirmed: true
}
tsize: 1024000
connections_loc: 6
dclock: 65106
dfreq: 14000
hgname: "florian-ubuntu20"
tdatatype {
}
}
topics {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5290
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
tid: "1628043986590"
tname: "topic"
direction: "subscriber"
tlayer {
type: tl_ecal_shm
confirmed: true
}
tsize: 1024000
connections_loc: 1
dclock: 65003
dfreq: 14000
hgname: "florian-ubuntu20"
tdatatype {
}
}
topics {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5303
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
tid: "1629142832302"
tname: "topic"
direction: "subscriber"
tlayer {
type: tl_ecal_shm
confirmed: true
}
tsize: 1024000
connections_loc: 1
dclock: 64976
dfreq: 14000
hgname: "florian-ubuntu20"
tdatatype {
}
}
topics {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5315
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
tid: "1630083352274"
tname: "topic"
direction: "subscriber"
tlayer {
type: tl_ecal_shm
confirmed: true
}
tsize: 1024000
connections_loc: 1
dclock: 64952
dfreq: 14000
hgname: "florian-ubuntu20"
tdatatype {
}
}
topics {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5327
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
tid: "1631091796732"
tname: "topic"
direction: "subscriber"
tlayer {
type: tl_ecal_shm
confirmed: true
}
tsize: 1024000
connections_loc: 1
dclock: 64943
dfreq: 14000
hgname: "florian-ubuntu20"
tdatatype {
}
}
topics {
rclock: 5482
hname: "florian-ubuntu20"
pid: 5353
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
tid: "1645633128470"
tname: "topic"
direction: "subscriber"
tlayer {
type: tl_ecal_shm
confirmed: true
}
tsize: 1024000
connections_loc: 1
dclock: 64574
dfreq: 14000
hgname: "florian-ubuntu20"
tdatatype {
}
}
topics {
rclock: 5483
hname: "florian-ubuntu20"
pid: 5373
pname: "/home/florian/Projects/_build/release/ecal_perftool"
uname: "ecal-perftool"
tid: "1654095462610"
tname: "topic"
direction: "subscriber"
tlayer {
type: tl_ecal_shm
confirmed: true
}
tsize: 1024000
connections_loc: 1
message_drops: 33478
dclock: 44831
dfreq: 6000
hgname: "florian-ubuntu20"
tdatatype {
}
}
So I now have a way of immediately triggering the issue. The setup is as follows:
--hickup
support https://github.com/FlorianReimold/ecal-perftool/commit/66bfce4273e59d30be1175edab576010342a9e93ecal.ini
with ack timeout of 500The only thing we have to do is make the subscriber miss the ack timeout once. After missing it once, the entire ack feature breaks down entirely and never recovers.
ecal_perftool pub topic 25 1024
ecal_perftool sub topic 100 --hickup 1500 550
Now watch the eCAL ack feature breaking. I think we can say, that SHM acks are pretty much useless on Ubuntu at the moment, as only 1 missed ack destroys the entire feature. I don't have a solution, yet, but on Windows it doesn't show the same effect, so there must be some solution.
Problem Description
eCAL seems to not handle SHM Acknowledgements properly. Even in a system that has enough resources, everything seems to degrade over time and the publisher will run into SHM ack timouts.
A first analysis shows, that if the Subscriber is not able to get memfile access in 5ms (hardcoded!) It will omit sending the shm ack and let the publisher wait for the timeout.
https://github.com/eclipse-ecal/ecal/blob/95480add0f802fa9d8d89bb39eddb7cd184f7f7b/ecal/core/src/io/ecal_memfile_pool.cpp#L166
Also, the list of events hold by the publisher is never cleaned up, so when subscribers go away, the publisher will still try to receive their events.
How to reproduce
ecal.ini
The publishing freq should be 10 Hz (due to the 100ms sub, that slows everything down), but after some time it will go down to 2Hz, which is the ack timeout:
How did you get eCAL?
Ubuntu PPA (apt-get)
Environment
eCAL 5.11.5 Ubuntu 20.04
eCAL System Information