OPC-UA-Adapter crash possibly caused by network problems

zenker commented 5 years ago

Yesterday two of the servers crashed. We had problems with one of our UALinks that manages communication between ProfiNet and OPCUA. Maybe the issue was caused by that problem. Two different types of ChimeraTK servers crashed so the cause was definitely the OPC-UA adapter. On the other hand only 4 out of 7 crashed.

#0  0x00007fe8eaaa1428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007fe8eaaa302a in __GI_abort () at abort.c:89
#2  0x00007fe8eaae37ea in __libc_message (do_abort=2, fmt=fmt@entry=0x7fe8eabfced8 "*** Error in '%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007fe8eaaea9dc in malloc_printerr (ar_ptr=0x7fe8c0000020, ptr=0x7fe8c0054000, str=0x7fe8eabf9c75 "corrupted size vs. prev_size", action=) at malloc.c:5006
#4  malloc_consolidate (av=av@entry=0x7fe8c0000020) at malloc.c:4183
#5  0x00007fe8eaaedcde in _int_malloc (av=av@entry=0x7fe8c0000020, bytes=bytes@entry=65535) at malloc.c:3450
#6  0x00007fe8eaaf0184 in __GI___libc_malloc (bytes=bytes@entry=65535) at malloc.c:2913
#7  0x00007fe8ee43dcb4 in UA_ByteString_allocBuffer (bs=0x7fe8d9335bd0, length=65535) at /build/libchimeratk-controlsystemadapter-opcuaadapter-01.09xenial1.00/src/open62541.c:4946
#8  0x00007fe8ee4438c1 in UA_SecureChannel_sendBinaryMessage (channel=0x7fe8c001ae90, requestId=1600325, content=content@entry=0x7fe8c026c900, contentType=contentType@entry=0x7fe8ee682a00 )
    at /build/libchimeratk-controlsystemadapter-opcuaadapter-01.09xenial1.00/src/open62541.c:15356
#9  0x00007fe8ee45281a in UA_Subscription_publishCallback (server=0x864f150, sub=0x7fe8c0013c00) at /build/libchimeratk-controlsystemadapter-opcuaadapter-01.09xenial1.00/src/open62541.c:24020
#10 0x00007fe8ee45e00b in processRepeatedJobs (dispatched=, current=12812888767592, server=0x864f150) at /build/libchimeratk-controlsystemadapter-opcuaadapter-01.09xenial1.00/src/open62541.c:18175
#11 UA_Server_run_iterate (server=0x864f150, waitInternal=waitInternal@entry=true) at /build/libchimeratk-controlsystemadapter-opcuaadapter-01.09xenial1.00/src/open62541.c:18521
#12 0x00007fe8ee430322 in ua_uaadapter::workerThread (this=0x8644f00) at /build/libchimeratk-controlsystemadapter-opcuaadapter-01.09xenial1.00/src/ua_adapter.cpp:207
#13 0x00007fe8ee410bd6 in ipc_managed_object_callWorker (theClass=) at /build/libchimeratk-controlsystemadapter-opcuaadapter-01.09xenial1.00/src/ipc_managed_object.cpp:27
#14 0x00007fe8eb40dc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#15 0x00007fe8ecd656ba in start_thread (arg=0x7fe8d9336700) at pthread_create.c:333
#16 0x00007fe8eab7341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

I 'm not 100% sure, but the crash might have happened when the UALink was disconnected. Which means a client disappeared without closing the session and cleaning up anything.

zenker commented 5 years ago

I realised that the UALink was only client to one of the server types. So the problem might be caused by network problems created by the UALink, but not directly by a client that is disconnected.

jpfr commented 5 years ago

The error message "corrupted size vs. prev_size" sounds bad. Usually malloc behaves in two ways:

Memory is allocated and returned
Not enough memory is available and NULL is returned

The error message hints at a more serious memory corruption. Can the problem be reproduced? Then we can run with valgrind or address sanitizer instrumention to get more insight.

zenker commented 5 years ago

After checking more carefully only servers of one type crashed. The UALink was client of all of these servers. One server crashed 30 min before the others. About 4 min before the first server crashed the system load started rising up rapidly (from 10 to about 250) until the server crashed. After the system load went back to normal values. I could not see anything special on the network traffic during that time period. Also the hard drives had no problems like a full disc. The back trace is from that crash. The crashes 30 min later might have had a different reason (e.g. #44).

zenker commented 5 months ago

Not observed any more.

ChimeraTK / ControlSystemAdapter-OPC-UA-Adapter

OPC-UA-Adapter crash possibly caused by network problems #42