epics-base / p4p

Python bindings for the PVAccess network client and server.
BSD 3-Clause "New" or "Revised" License
24 stars 37 forks source link

p4p gateway crashes #112

Closed ttkorhonen closed 1 year ago

ttkorhonen commented 1 year ago

Recently we have started seeing rather frequent p4p gateway crashes, or lockups at ESS. The gateway was very stable for a long period but apparently something has changed. We are running version 4.1.7. In most cases the gateway seems to hang rather than crash. In that state, it replies to e.g. pvlist by reporting its GUID, but does not serve any PVs. Attached is a stack trace. pva-gateway-gdb-threads.txt

aharrisson commented 1 year ago

I can add that before moving to 4.1.7 (pvxs 1.2.0) we were running 4.1.0 (pvxs 0.3.0) with the same issues. We hoped that the upgrade to 4.1.7 would solve the issue, but no such luck. The logs give no clear hints as to what happens. By the time it reaches locked/hanging state no more log entries are created.

aharrisson commented 1 year ago

To give further information, this is a user-facing chaining gateway with the following configuration:

{
  "version":2,
  "readOnly":true,
  "clients": [
  {
    "addrlist": "<ip-to-upstream-gateway-1>",
    "autoaddrlist": false,
    "name": "upstream-gateway-1"
  },
  {
    "addrlist": "<ip-to-upstream-gateway-2",
    "autoaddrlist": false,
    "name": "upstream-gateway-2"
  }
],
  "servers": [
  {
    "addrlist": "",
    "autoaddrlist": false,
    "clients": [
      "upstream-gateway-1",
      "upstream-gateway-2"
    ],
    "interface": [
      "<ip-for-interface-to-listen-on>"
    ],
    "name": "downstream-gateway",
    "statusprefix": "downstream-gateway:"
  }
]
}
mdavidsaver commented 1 year ago

Attached is a stack trace.

Thank you. From this I think I understand what went wrong. I think that 78c734e81a63d86133238a42827cd9d3ceabc374 should at least mitigate, if not resolve, this issue.

92283e0a500323d99addac59d79eb3e33e539fc8 addresses several similar situations, which could result in deadlock as well.

The bug is that a mutex is being held by the PVXCTCP (GW client) thread while awaiting processing of a request on the PVXTCP (GW server) thread. However, that request was in the processing queue behind another job which was trying to lock that same mutex. So effectively a deadlock, but not a situation which eg. lock order checking would detect.

mdavidsaver commented 1 year ago

In the hope of facilitating testing, I have uploaded 4.1.8a2 with these two fixes.

aharrisson commented 1 year ago

Thank you, @mdavidsaver . I have installed 4.1.8a2 now and it's been running fine under load for an hour now. I will report back if we see any further issues.

aharrisson commented 1 year ago

We had another issue with a locked state, this time running p4p 4.1.8a2. Attached are the thread backtraces.

mdavidsaver commented 1 year ago

This looks like another very similar issue. e3fc2107fd73b4dc9dc61c4827e76172ed99e1e4 in 4.1.8a3 is another attempt.

aharrisson commented 1 year ago

Thank you, @mdavidsaver . To report back: We have been running 4.1.8a3 on our troublesome gateway since last Thursday (2023-07-06) and we've seen no issues yet. It's maybe too early to write off the problem completely, but it's definitely had a good run the last few days. It should be mentioned that we've also mitigated the problem by scaling out to 3 parallel gateway instances, but we saw problems with that too before going to 4.1.8a3.

I'd say, we wait for a few more days and if it's still running fine we can close the issue.

mdavidsaver commented 1 year ago

I have pushed out 4.1.8.

aharrisson commented 1 year ago

Thank you, @mdavidsaver . 4.1.8a3 has been running solid at ESS for almost 2 weeks now. You can close this ticket.