MCTP recoverability chock the mainthread

drakedog2008 commented 7 months ago

When the MCTP layer of the device is working but the NVMe MI layer does not, nvmed will stuck in the MCTP recovery loop and consume most of the CPU slices on the mainthread. Dbus clients will timeout on:

query the unhealthy status for the bad device. query health status for the other good devices.

root@mvbce5-nfd11:~# time busctl tree xyz.openbmc_project.NVMe
└─ /xyz
  └─ /xyz/openbmc_project
    ├─ /xyz/openbmc_project/inventory
    │ └─ /xyz/openbmc_project/inventory/system
    │   └─ /xyz/openbmc_project/inventory/system/board
    │     ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1
    │     │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_1
    │     │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2
    │     │ │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/controllers
    │     │ │ │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/controllers/0
    │     │ │ │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/controllers/1
    │     │ │ │ └─ /xyz/openbmc_project/inventory/system/board/Chassis/Chassis_1_NVMe_2/controllers/2
    │     │ │ └─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/volumes
    │     │ │   └─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/volumes/1
 ...
    └─ /xyz/openbmc_project/sensors
      └─ /xyz/openbmc_project/sensors/temperature
        ├─ /xyz/openbmc_project/sensors/temperature/NVMe_2_NVMe_1
...
real    0m19.398s
user    0m0.080s
sys 0m0.021s

drakedog2008 commented 7 months ago

The nvmed log:

Apr 25 20:41:00 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:06 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6003ms
Apr 25 20:41:06 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:07 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:07 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:13 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6003ms
Apr 25 20:41:13 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:13 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:13 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:19 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6010ms
Apr 25 20:41:19 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:20 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:20 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:26 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6007ms
Apr 25 20:41:26 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:26 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:26 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:32 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6009ms
Apr 25 20:41:32 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:33 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:33 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available

drakedog2008 commented 7 months ago

Apr 25 20:41:33 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success

This MI cmd running on the worker thread, so the asnyc code (on main thread) for the device will yield, during which the dbus call will be handled. So the dbus call should take at most ~5 second.

Why the dbus tree call took ~20 second?

drakedog2008 commented 7 months ago

Ok. that 20 second may be due to that we have multiple devices stuck in the recovery loop.

When I reduce the number of broken devices to 1, busctl tree takes 2 sec to 5 sec range.

drakedog2008 commented 7 months ago

So it seems that the MI worker thread just returned too quickly, so the gap is not long enough to fulfill all async tasks populated in the main thread.

amboar commented 6 months ago

I'm working to understand this. My current approach is to try to capture the problem in an integration test case so we have something that we can reliably reproduce. I may have to change tack if that's not completely feasible, but I want to explore it.

amboar commented 6 months ago

Apr 25 20:41:26 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6007ms

What is emitting this log message? And what's its relationship to the main thread?

However, regardless of the answer to that, we probably want to back-off and limit the attempts to retry the endpoint optimisation. A tight-ish infinite loop of it failing isn't helpful to anyone.

drakedog2008 commented 6 months ago

I'm working to understand this. My current approach is to try to capture the problem in an integration test case so we have something that we can reliably reproduce. I may have to change tack if that's not completely feasible, but I want to explore it.

Thanks.

Mostly, these software issues are caused by device malfunction or partial functional which is not easy to capture in the integate test w/o error injection.

For this one particularly, the device's MCTP layer is serving but the NVMe layer is died. So the nvmed is stuck at

https://gbmc.googlesource.com/dbus-sensors/+/refs/heads/google-dev/src/NVMeMi.cpp#192

drakedog2008 commented 6 months ago

What is emitting this log message?

That is printed via a libnvme patch. We are printing timestamps for any libnvme call that exceed the threshold. And it shows the quirk id_cntrl takes ~6s and timeout.

And what's its relationship to the main thread?

The libnvme quirk is done within the libnvme level and as part of ep setup. ep setup is supposed to do all local operation (i.e. NON-IO), so it is executed in the mainthread instead of workerthread.

https://gbmc.googlesource.com/dbus-sensors/+/refs/heads/google-dev/src/NVMeMi.cpp#113

amboar commented 6 months ago

@drakedog2008 sorry for the delay. I've put together a series of fixes that you can find in the dev/issue-14 branch. Here's the series view.

I mentioned above that I wanted to develop an integration test to capture the problem, and then use that demonstrate that it is resolved. The first part of the series does some leg-work that allows the test cases to run both outside and inside of the OpenBMC CI container, which made it easier for me to develop the test. I expect this makes it easier for others to write and debug tests as well.

There are a few necessary patches for the OpenBMC CI container that go with the change above:

https://gerrit.openbmc.org/q/topic:%22test-setups%22+repo:openbmc/openbmc-build-scripts

You will need to include these in your test environment as well if you rely on openbmc-build-scripts.

Following dealing with the tests, the second phase of the series performs some refactoring to make what is currently setupMctpDevice() testable and maintainable in the face of adding further functionality. It does this by reworking the implementation to be a collection of methods on the NVMeDevice type.

Beyond that the series adds the test that exercises the problem in NVMeMi, and then resolves the problem by inserting a grace period on endpoint recovery. The current implementation sets that grace period at 5 seconds, but I figured we could discuss changes to that here if you think another value would be better suited.

As for the timeout on the probe IO generated by libnvme, for now I applied the following patch:

diff --git a/src/nvme/mi-mctp.c b/src/nvme/mi-mctp.c
index 0c5972a5677c..0886e67368b8 100644
--- a/src/nvme/mi-mctp.c
+++ b/src/nvme/mi-mctp.c
@@ -501,12 +501,7 @@ nvme_mi_ep_t nvme_mi_open_mctp(nvme_root_t root, unsigned int netid, __u8 eid)
        ep->transport = &nvme_mi_transport_mctp;
        ep->transport_data = mctp;

-       /* Assuming an i2c transport at 100kHz, smallest MTU (64+4). Given
-        * a worst-case clock stretch, and largest-sized packets, we can
-        * expect up to 1.6s per command/response pair. Allowing for a
-        * retry or two (handled by lower layers), 5s is a reasonable timeout.
-        */
-       ep->timeout = 5000;
+       ep->timeout = 1000;

        nvme_mi_ep_probe(ep);

I intend to start a discussion upstream about making the probe process explicit (i.e. adding an API to invoke it). With that we could move the IO onto the worker thread, avoiding the blocking in nvme_mi_open_mctp().

CodeConstruct / dbus-sensors

MCTP recoverability chock the mainthread #14