eclipse-ecal / ecal

📦 eCAL - enhanced Communication Abstraction Layer. A high performance publish-subscribe, client-server cross-plattform middleware.
https://ecal.io
Apache License 2.0
847 stars 178 forks source link

CSubscriber create dead lock #1628

Open kubbo opened 5 months ago

kubbo commented 5 months ago

Problem Description

sometimes I found CSubscriber create stucked,the stack is

Thread 20 (Thread 0xffff5cff5900 (LWP 4756)):
#0  futex_abstimed_wait (private=0, abstime=0x0, clockid=0, expected=2, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:284
#1  __pthread_rwlock_wrlock_full (abstime=0x0, clockid=0, rwlock=0xffff74001d30) at pthread_rwlock_common.c:830
#2  __GI___pthread_rwlock_wrlock (rwlock=0xffff74001d30) at pthread_rwlock_wrlock.c:27
#3  0x0000ffffa8303da4 in eCAL::CSubGate::Register(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared
_ptr<eCAL::CDataReader> const&) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#4  0x0000ffffa8306b80 in eCAL::CSubscriber::Create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, eCAL::SDat
aTypeInformation const&) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#5  0x0000ffffa8307540 in eCAL::CSubscriber::CSubscriber(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, eCAL:
:SDataTypeInformation const&) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#6  0x0000ffffa83075b8 in eCAL::CSubscriber::CSubscriber(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () fr
om /lib/aarch64-linux-gnu/libecal_core.so.5
#7  0x0000ffffaa94bf28 in ?? ()
#8  0x0000ffff50001030 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Thread 13 (Thread 0xffff90f8b900 (LWP 4749)):
#0  __lll_lock_wait (futex=futex@entry=0xffff984b2000, private=128) at lowlevellock.c:52
#1  0x0000ffffa9fd9cd8 in __GI___pthread_mutex_lock (mutex=0xffff984b2000) at pthread_mutex_lock.c:80
#2  0x0000ffffa82b59e4 in eCAL::CNamedMutexImpl::Lock(long) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#3  0x0000ffffa82b69b0 in eCAL::CMemoryFile::Create(char const*, bool, unsigned long, bool) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#4  0x0000ffffa82bacc8 in eCAL::CMemFileObserver::Create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::
__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#5  0x0000ffffa82bbf10 in eCAL::CMemFileThreadPool::ObserveFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&
, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std
::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::function<unsigned long (
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::
allocator<char> > const&, char const*, unsigned long, long long, long long, long long, unsigned long)> const&) () from /lib/aarch64-linux-gnu/libecal_
core.so.5
#6  0x0000ffffa831d888 in eCAL::CSHMReaderLayer::SetConnectionParameter(eCAL::SReaderLayerPar&) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#7  0x0000ffffa830b608 in eCAL::CDataReader::ApplyLocLayerParameter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co
nst&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, eCAL::pb::eTLayerType, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#8  0x0000ffffa8302d40 in eCAL::CSubGate::ApplyLocPubRegistration(eCAL::pb::Sample const&) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#9  0x0000ffffa832f9f0 in eCAL::CRegistrationReceiver::ApplySample(eCAL::pb::Sample const&) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#10 0x0000ffffa82d153c in eCAL::UDP::CSampleReceiver::Process(char const*, unsigned long) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#11 0x0000ffffa82d05a0 in void eCAL::CCallbackThread::callbackFunction<std::chrono::duration<long, std::ratio<1l, 1000l> > >(std::chrono::duration<lon
g, std::ratio<1l, 1000l> >) () from /lib/aarch64-linux-gnu/libecal_core.so.5
#12 0x0000ffffa86d8f9c in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#13 0x0000ffffa9fd7624 in start_thread (arg=0xffffa86d8f80) at pthread_create.c:477
#14 0x0000ffffa854662c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

How to reproduce

I have no way to manually reproduce it.

How did you get eCAL?

Ubuntu PPA (apt-get)

Environment

eCAL Version: 5.13.0

eCAL System Information

; --------------------------------------------------
; NETWORK SETTINGS
; --------------------------------------------------
; network_enabled                  = true / false                  true  = all eCAL components communicate over network boundaries
;                                                                  false = local host only communication
;
; multicast_config_version         = v1 / v2                       UDP configuration version (Since eCAL 5.12.)
;                                                                    v1: default behavior
;                                                                    v2: new behavior, comes with a bit more intuitive handling regarding masking of the groups
; multicast_group                  = 239.0.0.1                     UDP multicast group base
;                                                                  All registration and logging is sent on this address
; multicast_mask                   = 0.0.0.1-0.0.0.255             v1: Mask maximum number of dynamic multicast group
;                                    255.0.0.0-255.255.255.255     v2: masks are now considered like routes masking
;
; multicast_port                   = 14000 + x                     UDP multicast port number (eCAL will use at least the 2 following port
;                                                                    numbers too, so please modify in steps of 10 (e.g. 1010, 1020 ...)
;
; multicast_ttl                    = 0 + x                         UDP ttl value, also known as hop limit, is used in determining 
;                                                                    the intermediate routers being traversed towards the destination
;
; multicast_sndbuf                 = 1024 * x                      UDP send buffer in bytes
;  
; multicast_rcvbuf                 = 1024 * x                      UDP receive buffer in bytes
;
; multicast_join_all_if            = false                         Linux specific setting to enable joining multicast groups on all network interfacs
;                                                                    independent of their link state. Enabling this makes sure that eCAL processes
;                                                                    receive data if they are started before network devices are up and running.
;  
; bandwidth_max_udp                = -1                            UDP bandwidth limit for eCAL udp layer (-1 == unlimited)
;  
; inproc_rec_enabled               = true                          Enable to receive on eCAL inner process layer
; shm_rec_enabled                  = true                          Enable to receive on eCAL shared memory layer
; udp_mc_rec_enabled               = true                          Enable to receive on eCAL udp multicast layer
;
; npcap_enabled                    = false                         Enable to receive UDP traffic with the Npcap based receiver
;
; tcp_pubsub_num_executor_reader   = 4                             Tcp_pubsub reader amount of threads that shall execute workload
; tcp_pubsub_num_executor_writer   = 4                             Tcp_pubsub writer amount of threads that shall execute workload
; tcp_pubsub_max_reconnections     = 5                             Tcp_pubsub reconnection attemps the session will try to reconnect in 
;                                                                    case of an issue (a negative value means infinite reconnection attemps)
;
; host_group_name                  =                               Common host group name that enables interprocess mechanisms across 
;                                                                    (virtual) host borders (e.g, Docker); by default equivalent to local host name
; --------------------------------------------------

[network]
network_enabled                    = false
multicast_config_version           = v1
multicast_group                    = 239.0.0.1
multicast_mask                     = 0.0.0.15
multicast_port                     = 14000
multicast_ttl                      = 2
multicast_sndbuf                   = 5242880
multicast_rcvbuf                   = 5242880

multicast_join_all_if              = false

bandwidth_max_udp                  = -1

inproc_rec_enabled                 = true
shm_rec_enabled                    = true
tcp_rec_enabled                    = true
udp_mc_rec_enabled                 = true

npcap_enabled                      = false

tcp_pubsub_num_executor_reader     = 4
tcp_pubsub_num_executor_writer     = 4
tcp_pubsub_max_reconnections       = 5

host_group_name                    =

; --------------------------------------------------
; COMMON SETTINGS
; --------------------------------------------------
; registration_timeout             = 60000                         Timeout for topic registration in ms (internal)
; registration_refresh             = 1000                          Topic registration refresh cylce (has to be smaller then registration timeout !)

; --------------------------------------------------
[common]
registration_timeout               = 60000
registration_refresh               = 10

; --------------------------------------------------
; TIME SETTINGS
; --------------------------------------------------
; timesync_module_rt               = "ecaltime-localtime"          Time synchronisation interface name (dynamic library)
;                                                                  The name will be extended with platform suffix (32|64), debug suffix (d) and platform extension (.dll|.so)
;
;                                                                  Available modules are:
;                                                                    - ecaltime-localtime    local system time without synchronization        
;                                                                    - ecaltime-linuxptp     For PTP / gPTP synchronization over ethernet on Linux
;                                                                                            (device configuration in ecaltime.ini)
;                                                                    - ecaltime-simtime      Simulation time as published by the eCAL Player.
; --------------------------------------------------
[time]
timesync_module_rt                 = "ecaltime-localtime"

; ---------------------------------------------
; PROCESS SETTINGS
; ---------------------------------------------
;
; terminal_emulator                = /usr/bin/x-terminal-emulator -e    command for starting applications with an external terminal emulator. If empty, the command will be ignored. Ignored on Windows.
;                                                                       e.g.  /usr/bin/x-terminal-emulator -e
;                                                                             /usr/bin/gnome-terminal -x
;                                                                             /usr/bin/xterm -e
;
; ---------------------------------------------
[process]
terminal_emulator                  = 

; --------------------------------------------------
; PUBLISHER SETTINGS
; --------------------------------------------------
; use_inproc                       = 0, 1, 2                       Use inner process transport layer (0 = off, 1 = on, 2 = auto, default = 0)
; use_shm                          = 0, 1, 2                       Use shared memory transport layer (0 = off, 1 = on, 2 = auto, default = 2)
; use_tcp                          = 0, 1, 2                       Use tcp transport layer           (0 = off, 1 = on, 2 = auto, default = 0)
; use_udp_mc                       = 0, 1, 2                       Use udp multicast transport layer (0 = off, 1 = on, 2 = auto, default = 2)
;
; memfile_minsize                  = x * 4096 kB                   Default memory file size for new publisher
;
; memfile_reserve                  = 50 .. x %                     Dynamic file size reserve before recreating memory file if topic size changes
;
; memfile_ack_timeout              = 0 .. x ms                     Publisher timeout for ack event from subscriber that memory file content is processed
;
; memfile_buffer_count             = 1 .. x                        Number of parallel used memory file buffers for 1:n publish/subscribe ipc connections (default = 1)
; memfile_zero_copy                = 0, 1                          Allow matching subscriber to access memory file without copying its content in advance (blocking mode)
;
; share_ttype                      = 0, 1                          Share topic type via registration layer
; share_tdesc                      = 0, 1                          Share topic description via registration layer (switch off to disable reflection)
; --------------------------------------------------
[publisher]
use_inproc                         = 0
use_shm                            = 2
use_tcp                            = 0
use_udp_mc                         = 2

memfile_minsize                    = 4096
memfile_reserve                    = 50
memfile_ack_timeout                = 0
memfile_buffer_count               = 1
memfile_zero_copy                  = 0

share_ttype                        = 1
share_tdesc                        = 1

; --------------------------------------------------
; SERVICE SETTINGS
; --------------------------------------------------
; protocol_v0                      = 0, 1                          Support service protocol v0, eCAL 5.11 and older (0 = off, 1 = on)
; protocol_v1                      = 0, 1                          Support service protocol v1, eCAL 5.12 and newer (0 = off, 1 = on)
; --------------------------------------------------
[service]
protocol_v0                        = 1
protocol_v1                        = 1

; --------------------------------------------------
; MONITORING SETTINGS
; --------------------------------------------------
; timeout                          = 1000 + (x * 1000)             Timeout for topic monitoring in ms
; filter_excl                      = ^__.*$                        Topics blacklist as regular expression (will not be monitored)
; filter_incl                      =                               Topics whitelist as regular expression (will be monitored only)
; filter_log_con                   = info, warning, error, fatal   Log messages logged to console (all, info, warning, error, fatal, debug1, debug2, debug3, debug4)
; filter_log_file                  =                               Log messages to logged into file system
; filter_log_udp                   = info, warning, error, fatal   Log messages logged via udp network
; --------------------------------------------------
[monitoring]
timeout                            = 1000
filter_excl                        = ^__.*$
filter_incl                        =
filter_log_con                     = info, warning, error, fatal
filter_log_file                    =
filter_log_udp                     = info, warning, error, fatal

; --------------------------------------------------
; SYS SETTINGS
; --------------------------------------------------
; filter_excl                      = App1,App2                     Apps blacklist to be excluded when importing tasks from cloud
; --------------------------------------------------
[sys]
filter_excl                        = ^eCALSysClient$|^eCALSysGUI$|^eCALSys$

; --------------------------------------------------
; EXPERIMENTAL SETTINGS
; --------------------------------------------------
; shm_monitoring_enabled           = false                         Enable distribution of monitoring/registration information via shared memory
; shm_monitoring_domain            = ecal_monitoring               Domain name for shared memory based monitoring/registration
; shm_monitoring_queue_size        = 1024                          Queue size of monitoring/registration events
; network_monitoring_disabled      = false                         Disable distribution of monitoring/registration information via network
;
; drop_out_of_order_messages       = false                         Enable dropping of payload messages that arrive out of order
; --------------------------------------------------
[experimental]
shm_monitoring_enabled             = false
shm_monitoring_domain              = ecal_mon
shm_monitoring_queue_size          = 1024
network_monitoring_disabled        = false
drop_out_of_order_messages         = false
KerstinKeller commented 5 months ago

Hi @kubbo, sorry, somehow we missed your issue. What programm are you running exactly? Do you have a reproducible sample? Which eCAL version? We have recently released eCAL 5.12.6 and 5.13.2.