Closed ttkorhonen closed 1 year ago
I can add that before moving to 4.1.7 (pvxs 1.2.0) we were running 4.1.0 (pvxs 0.3.0) with the same issues. We hoped that the upgrade to 4.1.7 would solve the issue, but no such luck. The logs give no clear hints as to what happens. By the time it reaches locked/hanging state no more log entries are created.
To give further information, this is a user-facing chaining gateway with the following configuration:
{
"version":2,
"readOnly":true,
"clients": [
{
"addrlist": "<ip-to-upstream-gateway-1>",
"autoaddrlist": false,
"name": "upstream-gateway-1"
},
{
"addrlist": "<ip-to-upstream-gateway-2",
"autoaddrlist": false,
"name": "upstream-gateway-2"
}
],
"servers": [
{
"addrlist": "",
"autoaddrlist": false,
"clients": [
"upstream-gateway-1",
"upstream-gateway-2"
],
"interface": [
"<ip-for-interface-to-listen-on>"
],
"name": "downstream-gateway",
"statusprefix": "downstream-gateway:"
}
]
}
Attached is a stack trace.
Thank you. From this I think I understand what went wrong. I think that 78c734e81a63d86133238a42827cd9d3ceabc374 should at least mitigate, if not resolve, this issue.
92283e0a500323d99addac59d79eb3e33e539fc8 addresses several similar situations, which could result in deadlock as well.
The bug is that a mutex is being held by the PVXCTCP
(GW client) thread while awaiting processing of a request on the PVXTCP
(GW server) thread. However, that request was in the processing queue behind another job which was trying to lock that same mutex. So effectively a deadlock, but not a situation which eg. lock order checking would detect.
In the hope of facilitating testing, I have uploaded 4.1.8a2 with these two fixes.
Thank you, @mdavidsaver . I have installed 4.1.8a2 now and it's been running fine under load for an hour now. I will report back if we see any further issues.
We had another issue with a locked state, this time running p4p 4.1.8a2. Attached are the thread backtraces.
This looks like another very similar issue. e3fc2107fd73b4dc9dc61c4827e76172ed99e1e4 in 4.1.8a3 is another attempt.
Thank you, @mdavidsaver . To report back: We have been running 4.1.8a3 on our troublesome gateway since last Thursday (2023-07-06) and we've seen no issues yet. It's maybe too early to write off the problem completely, but it's definitely had a good run the last few days. It should be mentioned that we've also mitigated the problem by scaling out to 3 parallel gateway instances, but we saw problems with that too before going to 4.1.8a3.
I'd say, we wait for a few more days and if it's still running fine we can close the issue.
I have pushed out 4.1.8
.
Thank you, @mdavidsaver . 4.1.8a3 has been running solid at ESS for almost 2 weeks now. You can close this ticket.
Recently we have started seeing rather frequent p4p gateway crashes, or lockups at ESS. The gateway was very stable for a long period but apparently something has changed. We are running version 4.1.7. In most cases the gateway seems to hang rather than crash. In that state, it replies to e.g. pvlist by reporting its GUID, but does not serve any PVs. Attached is a stack trace. pva-gateway-gdb-threads.txt