Closed zenker closed 5 months ago
I realised that the UALink was only client to one of the server types. So the problem might be caused by network problems created by the UALink, but not directly by a client that is disconnected.
The error message "corrupted size vs. prev_size" sounds bad. Usually malloc behaves in two ways:
The error message hints at a more serious memory corruption.
Can the problem be reproduced?
Then we can run with valgrind
or address sanitizer
instrumention to get more insight.
After checking more carefully only servers of one type crashed. The UALink was client of all of these servers. One server crashed 30 min before the others. About 4 min before the first server crashed the system load started rising up rapidly (from 10 to about 250) until the server crashed. After the system load went back to normal values. I could not see anything special on the network traffic during that time period. Also the hard drives had no problems like a full disc. The back trace is from that crash. The crashes 30 min later might have had a different reason (e.g. #44).
Not observed any more.
Yesterday two of the servers crashed. We had problems with one of our UALinks that manages communication between ProfiNet and OPCUA. Maybe the issue was caused by that problem. Two different types of ChimeraTK servers crashed so the cause was definitely the OPC-UA adapter. On the other hand only 4 out of 7 crashed.
I 'm not 100% sure, but the crash might have happened when the UALink was disconnected. Which means a client disappeared without closing the session and cleaning up anything.