eclipse-ecal / ecal

📦 eCAL - enhanced Communication Abstraction Layer. A high performance publish-subscribe, client-server cross-plattform middleware.
https://ecal.io
Apache License 2.0
842 stars 174 forks source link

SHM Acks degrade over time #1194

Closed FlorianReimold closed 1 year ago

FlorianReimold commented 1 year ago

Problem Description

eCAL seems to not handle SHM Acknowledgements properly. Even in a system that has enough resources, everything seems to degrade over time and the publisher will run into SHM ack timouts.

How to reproduce

  1. Set Ack timeout to 500 in ecal.ini
  2. Compile the ecal-perftool
  3. Start 1 publisher
    ecal_perftool pub topic 25 1024000
  4. Start 4 susbscribers
    ecal_perftool sub topic 10
  5. Start 5 subscribers that slow down the system a little bit, but not too much
    ecal_perftool sub topic 100

The publishing freq should be 10 Hz (due to the 100ms sub, that slows everything down), but after some time it will go down to 2Hz, which is the ack timeout: image

How did you get eCAL?

Ubuntu PPA (apt-get)

Environment

eCAL 5.11.5 Ubuntu 20.04

eCAL System Information

------------------------- SYSTEM ---------------------------------
Version                  : v5.11.5 (2023-07-24 12:19:00 +0200)
Platform                 : linux

------------------------- CONFIGURATION --------------------------
Default INI              : /etc/ecal/ecal.ini

------------------------- NETWORK --------------------------------
Host name                : florian-ubuntu20
Network mode             : local
Network ttl              : 2
Network sndbuf           : 5 MByte
Network rcvbuf           : 5 MByte
Multicast group          : 239.0.0.1
Multicast mask           : 0.0.0.15
Multicast ports          : 14000 - 14010
Multicast join all IFs   : off
Bandwidth limit (udp)    : not limited

------------------------- TIME -----------------------------------
Synchronization realtime : "ecaltime-localtime"
Synchronization replay   :
State                    :  synchronized
Master / Slave           :  Master
Status (Code)            : "everything is fine." (0)

------------------------- PUBLISHER LAYER DEFAULTS ---------------
Layer Mode INPROC        : off
Layer Mode SHM           : auto
Layer Mode TCP           : off
Layer Mode UDP MC        : auto

------------------------- SUBSCRIPTION LAYER DEFAULTS ------------
Layer Mode INPROC        : on
Layer Mode SHM           : on
Layer Mode TCP           : on
Layer Mode UDP MC        : on
Npcap UDP Reciever       : off
FlorianReimold commented 1 year ago

Current (workaround?) branch is here: https://github.com/eclipse-ecal/ecal/tree/hotfix/memfile_ack_timeout_5_11

Currently, this branch only increases the internal timeout making it more unlikely to miss it.

FlorianReimold commented 1 year ago

No idea why, but this is what has happened. Could be the testing tool ecal_perftool, not sure, yet. But for sure, the ack_timeout also doesn not what it is supposed to do.

As always, the issue created itself after some time.

The test was performed with the current HEAD of https://github.com/eclipse-ecal/ecal/tree/hotfix/shm_data_drop This shows, that there still is an issue. We haven't fixed it, but we have relieved the effect.

1 x publisher

./ecal_perftool pub topic 25 1024000

[  4355.032] | cnt:      9 | mean_loop_dt (ms): 101.478 | loop_freq (Hz):     9.9 | mean_snd_dt (ms): 101.401
[  4356.033] | cnt:     10 | mean_loop_dt (ms): 101.202 | loop_freq (Hz):     9.9 | mean_snd_dt (ms): 101.153
[  4357.033] | cnt:     10 | mean_loop_dt (ms): 101.278 | loop_freq (Hz):     9.9 | mean_snd_dt (ms): 101.333
[  4358.034] | cnt:     10 | mean_loop_dt (ms): 101.408 | loop_freq (Hz):     9.9 | mean_snd_dt (ms): 101.473
[  4359.122] | cnt:      5 | mean_loop_dt (ms): 101.102 | loop_freq (Hz):     9.9 | mean_snd_dt (ms): 101.182
[  4360.123] | cnt:     14 | mean_loop_dt (ms): 121.973 | loop_freq (Hz):     8.2 | mean_snd_dt (ms):  92.370
[  4361.124] | cnt:     13 | mean_loop_dt (ms):  78.496 | loop_freq (Hz):    12.7 | mean_snd_dt (ms):  49.202
[  4362.124] | cnt:     14 | mean_loop_dt (ms):  75.455 | loop_freq (Hz):    13.3 | mean_snd_dt (ms):  46.485
[  4363.124] | cnt:     13 | mean_loop_dt (ms):  78.439 | loop_freq (Hz):    12.7 | mean_snd_dt (ms):  49.053
[  4364.125] | cnt:     14 | mean_loop_dt (ms):  75.445 | loop_freq (Hz):    13.3 | mean_snd_dt (ms):  46.402

4 x 10ms subscriber:

./ecal_perftool sub topic 10

[  4355.032] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.383 | msg_freq (Hz):     9.9
[  4356.033] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.252 | msg_freq (Hz):     9.9
[  4357.033] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.317 | msg_freq (Hz):     9.9
[  4358.034] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.439 | msg_freq (Hz):     9.9
[  4359.124] | cnt:      5 | lost:      0 | msg_dt_mean (ms): 101.091 | msg_freq (Hz):     9.9
[  4360.124] | cnt:     14 | lost:      0 | msg_dt_mean (ms):  75.396 | msg_freq (Hz):    13.3
[  4361.124] | cnt:     13 | lost:      0 | msg_dt_mean (ms):  78.415 | msg_freq (Hz):    12.8
[  4362.125] | cnt:     14 | lost:      0 | msg_dt_mean (ms):  75.511 | msg_freq (Hz):    13.2
[  4363.125] | cnt:     13 | lost:      0 | msg_dt_mean (ms):  78.357 | msg_freq (Hz):    12.8
[  4364.125] | cnt:     14 | lost:      0 | msg_dt_mean (ms):  75.406 | msg_freq (Hz):    13.3

1 x 10ms subscriber that behaves substantially different and suddenly received more than what is published!

./ecal_perftool sub topic 10

[  4355.365] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.324 | msg_freq (Hz):     9.9
[  4356.365] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.035 | msg_freq (Hz):     9.9
[  4357.366] | cnt:      9 | lost:      0 | msg_dt_mean (ms): 101.414 | msg_freq (Hz):     9.9
[  4358.367] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.375 | msg_freq (Hz):     9.9
[  4359.367] | cnt:      4 | lost:      0 | msg_dt_mean (ms): 262.172 | msg_freq (Hz):     3.8
[  4360.368] | cnt:     13 | lost:      0 | msg_dt_mean (ms):  40.071 | msg_freq (Hz):    25.0
[  4361.368] | cnt:     13 | lost:      0 | msg_dt_mean (ms):  40.004 | msg_freq (Hz):    25.0
[  4362.369] | cnt:     14 | lost:      0 | msg_dt_mean (ms):  40.013 | msg_freq (Hz):    25.0
[  4363.370] | cnt:     13 | lost:      0 | msg_dt_mean (ms):  40.063 | msg_freq (Hz):    25.0
[  4364.370] | cnt:     14 | lost:      0 | msg_dt_mean (ms):  40.060 | msg_freq (Hz):    25.0

1 x 100ms subscriber

./ecal_perftool sub topic 100

[  4355.032] | cnt:      9 | lost:      0 | msg_dt_mean (ms): 101.366 | msg_freq (Hz):     9.9
[  4356.033] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.189 | msg_freq (Hz):     9.9
[  4357.033] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.325 | msg_freq (Hz):     9.9
[  4358.034] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.391 | msg_freq (Hz):     9.9
[  4359.124] | cnt:      6 | lost:      0 | msg_dt_mean (ms): 101.162 | msg_freq (Hz):     9.9
[  4360.124] | cnt:      6 | lost:      4 | msg_dt_mean (ms): 168.551 | msg_freq (Hz):     5.9
[  4361.124] | cnt:      6 | lost:      5 | msg_dt_mean (ms): 176.479 | msg_freq (Hz):     5.7
[  4362.125] | cnt:      6 | lost:      5 | msg_dt_mean (ms): 175.789 | msg_freq (Hz):     5.7
[  4363.125] | cnt:      7 | lost:      5 | msg_dt_mean (ms): 153.267 | msg_freq (Hz):     6.5
[  4364.125] | cnt:      6 | lost:      5 | msg_dt_mean (ms): 176.047 | msg_freq (Hz):     5.7

eCAL Mon shows a little different picture:

image

image

processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5279
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool pub topic 25 1024000"
  pmemory: 435613696
  usrptime: -1
  datawrite: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5290
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 10"
  pmemory: 585547776
  usrptime: -1
  dataread: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5303
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 10"
  pmemory: 510046208
  usrptime: -1
  dataread: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5315
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 10"
  pmemory: 510046208
  usrptime: -1
  dataread: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5327
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 10"
  pmemory: 510046208
  usrptime: -1
  dataread: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5482
  hname: "florian-ubuntu20"
  pid: 5353
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 10"
  pmemory: 510046208
  usrptime: -1
  dataread: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5373
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 100"
  pmemory: 510046208
  usrptime: -1
  dataread: 6144000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5436
  pname: "/usr/bin/ecal_mon_gui-5.12.0"
  uname: "eCALMon"
  pparam: "/usr/bin/ecal_mon_gui"
  pmemory: 934182912
  usrptime: -1
  state {
    severity: proc_sev_healthy
    info: "Running"
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 191
  component_init_info: "pub|sub|srv|mon|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5279
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1624945319207"
  tname: "topic"
  direction: "publisher"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 6
  dclock: 65106
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5290
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1628043986590"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  dclock: 65003
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5303
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1629142832302"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  dclock: 64976
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5315
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1630083352274"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  dclock: 64952
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5327
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1631091796732"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  dclock: 64943
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5482
  hname: "florian-ubuntu20"
  pid: 5353
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1645633128470"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  dclock: 64574
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5373
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1654095462610"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  message_drops: 33478
  dclock: 44831
  dfreq: 6000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
FlorianReimold commented 1 year ago

So I now have a way of immediately triggering the issue. The setup is as follows:

The only thing we have to do is make the subscriber miss the ack timeout once. After missing it once, the entire ack feature breaks down entirely and never recovers.

  1. ecal_perftool pub topic 25 1024
  2. ecal_perftool sub topic 100 --hickup 1500 550

Now watch the eCAL ack feature breaking. I think we can say, that SHM acks are pretty much useless on Ubuntu at the moment, as only 1 missed ack destroys the entire feature. I don't have a solution, yet, but on Windows it doesn't show the same effect, so there must be some solution.