CodeConstruct / dbus-sensors

D-Bus configurable sensor scanning applications
Apache License 2.0
0 stars 3 forks source link

Failed to delete/recreate subsystem #1

Closed drakedog2008 closed 8 months ago

drakedog2008 commented 11 months ago

Hello,

When the NVMe device is disconnected, the subsystem should transit into disable status with all substructure deleted. However the shared ptr of Pcontroller is held by the subsystem and thus cannot be release:

https://github.com/CodeConstruct/dbus-sensors/blob/dev/nvme-redfish/include/NVMeSubsys.hpp#L124C44-L124C61

drakedog2008 commented 11 months ago

Ideally, we cannot assume the subsystem has only one PrimaryController. Instead, we should scan the controller list and the secondary controller list should be the indicator for the SController. https://github.com/CodeConstruct/dbus-sensors/blob/dev/nvme-redfish/include/NVMeController.hpp#L105C30-L105C50

drakedog2008 commented 11 months ago

Besides, I added a unit test targeting the resource releasing.

https://gbmc.googlesource.com/meta-gbmc-staging/+/refs/heads/gbmc-release-23.18.x/recipes-phosphor/sensors/dbus-sensors/0030-nvmesensor-add-unit-test.patch

That unit test should be able to catch the issue like this.

Creating another ticket for the enforcement.

https://github.com/CodeConstruct/dbus-sensors/issues/2

amboar commented 11 months ago

So there are a couple of ways to represent the NVMe device being disconnected:

  1. Surprise unplug
  2. Making the FRU VPD disappear

It's unclear to me whether the current integration makes the FRU VPD disappear from e.g. fru-device's hosted objects on device disconnection. However, if it does, we can approximate the impact of this event by issuing systemctl stop xyz.openbmc_project.FruDevice. As it turns out, this appears to work fine; the controller, volume and sensor objects are all destroyed and recreated as we stop and start the fru-device service.

However, surprise unplug has symptoms that align with what you're describing. Prior to the surprise unplug event the subsystem, controller, drive and sensor objects have all be instantiated and are monitoring/managing the drive. After the surprise unplug, the periodic poll via the ctempTimer ends in a communication failure as expected. The ctempTimer callback marks the subsystem and the sensor as non-functional (markFunctional(false)):

https://github.com/CodeConstruct/dbus-sensors/blob/dev/nvme-redfish-fixes/src/NVMeSubsys.cpp#L465-L467

However, this doesn't result in the controller, volume and sensor objects being removed from D-Bus as markFunctional(false) fails to clean up the primaryController, volumes and attached members of NVMeSubsystem.

With the following patch we get closer to the desired outcome:

diff --git a/src/NVMeSubsys.cpp b/src/NVMeSubsys.cpp
index 9768f725ea24..59cc031b4b2a 100644
--- a/src/NVMeSubsys.cpp
+++ b/src/NVMeSubsys.cpp
@@ -138,6 +138,9 @@ void NVMeSubsystem::markFunctional(bool toggle)
         // TODO: the controller should be stopped after controller level polling
         // is enabled

+        attached.clear();
+        volumes.clear();
+        primaryController.reset();
         controllers.clear();
         plugin.reset();
         return;

With the current implementation we need the NVMeSubsystem instance to stay live to continue the periodic polling for the device in the event that it once again becomes functional. It seems reasonable that the NVMeSubsystem instance should only be destructed if the FRU VPD is removed from DBus. However, the current implementation of the ctempTimer callback does not address re-establishing MCTP connectivity to the drive if it has been power-cycled. As such while we continue to poll for the drive's recovery it will not succeed without some other intervention.

I think this amounts to some refactoring to address the connectivity problem. A concept that might be worthwhile is introducing an NVMeEndpoint class that handles establishing connectivity and device probing, constructing and destructing an instance of NVMeSubsystem as required. But, I feel I need to think about that some more.

Some collateral damage from the surprise unplug is that the I2C bus may lock up. This is likely resolved by the following patch:

https://lore.kernel.org/openbmc/ZRZ%2FObZmntMLw2r+@ninjato/T/#med947449c015eee678c3173bc63df1e5286b2cfc

jinliangw commented 11 months ago

Besides the primaryController.reset(), I think we still need to check whether primaryController is valid before using it. For example , primaryController->getMiCtrl() in https://github.com/CodeConstruct/dbus-sensors/blob/dev/nvme-redfish/src/NVMeSubsys.cpp#L1027. Since the Erase method belong to a subsystem, it may be called even when there is no controllers.

mkj commented 11 months ago

There's a change in the nvme-redfish-fixes branch to access it through a method https://github.com/CodeConstruct/dbus-sensors/commit/794103cd4d978d4f2a5209c3d45655a06dea804d

https://github.com/CodeConstruct/dbus-sensors/blob/dev/nvme-redfish-fixes/src/NVMeSubsys.cpp#L1029

drakedog2008 commented 11 months ago

Re com:

That is good observation.

Actually there are different levels of reset that triggers different disconnection and results in different behaviors in nvmed:

Reset level Possible causes nvmed symptom
Bus reset kernel bus crash, i2c clock stretch Fru dismiss, NVMeMi and NVMeSubsystem instance destroy
device/MCTP reset device panic NVMeMi::isMCTPconnect() == false
subsystem reset mi_subsystem_shutdown/reset NVMeSubsystem not functional.
controller offline/reset SRIOV disable, CC.EN = 0 or CC.SHT = 1 Controller transit from NVMeControllerEnabled to NVMeController (not implemented yet)

However, this doesn't result in the controller, volume and sensor objects being removed from D-Bus as markFunctional(false) fails to clean up the primaryController, volumes and attached members of NVMeSubsystem.

Yes, all substructure inside NVMeSubsystem should be recollected similar to controllers, including NS and anything else such as Endurance Group. The only exemption is the Drive Dbus interface. The interface should continue to exist but the Health status should transit into absent (functional == false and available == false).

I think this amounts to some refactoring to address the connectivity problem. A concept that might be worthwhile is introducing an NVMeEndpoint class that handles establishing connectivity and device probing, constructing and destructing an instance of NVMeSubsystem as required. But, I feel I need to think about that some more.

I think what you asked here is similar to the NVMeMi class? It is an abstraction for Mi over MCTP API (including message routing and scheduling). And a MCTP connectivity polling has been planned inside the class as I chatted with @jk-ozlabs

drakedog2008 commented 11 months ago

Besides the primaryController.reset(), I think we still need to check whether primaryController is valid before using it. For example , primaryController->getMiCtrl() in https://github.com/CodeConstruct/dbus-sensors/blob/dev/nvme-redfish/src/NVMeSubsys.cpp#L1027. Since the Erase method belong to a subsystem, it may be called even when there is no controllers.

I would suppose all Admin interfaces (including NS management) should be removed after subsystem marked as unfunctional. The ongoing tasks should be gracefully drained and then release the relying resources (the NS management task relying on the PF in this case).

amboar commented 11 months ago

I think this amounts to some refactoring to address the connectivity problem. A concept that might be worthwhile is introducing an NVMeEndpoint class that handles establishing connectivity and device probing, constructing and destructing an instance of NVMeSubsystem as required. But, I feel I need to think about that some more.

I think what you asked here is similar to the NVMeMi class? It is an abstraction for Mi over MCTP API (including message routing and scheduling). And a MCTP connectivity polling has been planned inside the class as I chatted with @jk-ozlabs

Okay, I'll chat with Jeremy about that when I have the opportunity. I'm still building my understanding of the code and the NVMeMi abstraction was something that hadn't yet focused on. I will see if it aligns like you suggest.

amboar commented 11 months ago

Yes, all substructure inside NVMeSubsystem should be recollected similar to controllers, including NS and anything else such as Endurance Group. The only exemption is the Drive Dbus interface. The interface should continue to exist but the Health status should transit into absent (functional == false and available == false).

Okay. So short of tackling the work to deal with re-establishing MCTP connectivity, I think your requirements are met by the three line patch I outlined in my comment above.

  1. Controllers are collected via controllers.clear() with the addition of primaryController.reset().
  2. Namespaces are collected in the volumes map, which is now emptied with the addition of volumes.clear()
  3. git grep turns up no mention of support for endurance groups, so I assume this is taken care of by its absence.
  4. The drive member is not reset in the markFunctional(false) path, maintaining its existence on D-Bus. Health status is covered by the existing dataProcessor error handling

Are you happy to take that patch? If so I can push that to one or both of the dev/nvme-redfish or dev/nvme-redfish-fixes branches.

drakedog2008 commented 11 months ago

I think the bigger problem is the async task for cleaning is holding a bare controller id instead of the controller instance. (e.g.). So when the callback handler is triggered, the controller instance may already be released.

Think about the following scene:

  1. Subsystem start createVolume
  2. createNamespace finished on NVMeMi worker thread.
  3. finished_cb is scheduled and in the immediate_exec queue.
  4. mi_health_poll failed and the PF controller is disabled and destroyed.
  5. volume list is also cleared by markFunctional(false)
  6. finished_cb with createVolumeFinished() moves into exec queue.
  7. An additional volume remains on a disabled subsystem.

A possible fix is posted in the comment: https://github.com/openbmc/dbus-sensors/commit/92ec46692152250aa91435a531d1695517fd4b31#commitcomment-129274827

amboar commented 11 months ago

Okay, let me continue to improve my familiarity with the code so I can derive that insight for myself. Thanks for identifying the concerns.

amboar commented 10 months ago

Firstly, apologies that I have taken some time to reply. I've become more familiar with the existing code and believe I understand your concerns above @drakedog2008. I've put together this series and am working to verify it on the hardware I have at hand. It would be great if you could take a look at the patches and provide feedback.

https://github.com/amboar/dbus-sensors/compare/07415ea918be817deb6f36be3c98e2929940b815...mctp-mi

drakedog2008 commented 9 months ago

I was able to make the unit test pass.

First, this CL fixed the worker io issue.

Now the 3rd test TestDriveAbsent will throw a FileExist exception instead of hanging there. It is good since we know something is not recollected.

jianghao@jianghao:/workspaces/dbus-sensors$ git cherry  a69263a0a7998673d24d469206e79111017d0564 9b6bc12e115b8d6cffcef4ac36541621bd629d0c  -v 
+ d8d79bcb4b969dd8748d91a19a31b8058eaca9da Revert "nvme: Simplify Worker::post"
+ 271d2ddb4c9c804e17a8cac2a9496011bc911a81 nvmed: fix the IO worker issue
- 534c54fd010f03b4115cd43aba0ef0124fb351b6 nvme: Add getPrimaryController() helper
- 471bb259755a2294529b8b2a535c6822358041c6 NVMeSubsys: Uphold safety properties in markFunctional()
- be8caeb27e4d13fa51c96fd2fac0d27060a4e8ce NVMeController: Expose disable() for desstruction
- 205f141f363e4bf0ccb07123121ffccc50a266b9 NVMeProgress: Expose abort()
- 0f87fd006da201d44132c457d9109f7d392cf23d NVMeMiIntf: Expose flushOperations()
- 40872f9bda131edcdec1c3706fef546870c7a4ed NVMeSubsys: Clean-up invalidated controllers, volumes and associations
- ca722e0b402670c4c302ea359245fb6e7791ec63 nvmesensors: add print info for unit test
- 9b6bc12e115b8d6cffcef4ac36541621bd629d0c nvmesensor: setup the environment for unit-test
drakedog2008 commented 9 months ago

W/o the change:


jianghao@jianghao:/workspaces/dbus-sensors$ git log -1 --oneline
a69263a (HEAD -> tmp) nvmesensor: setup the environment for unit-test
jianghao@jianghao:/workspaces/dbus-sensors$ ./builddir/tests/test_nvme 
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from NVMeTest
...
[       OK ] NVMeTest.TestDriveFunctional (6007 ms)
[ RUN      ] NVMeTest.TestDriveAbsent
NVMeMiFake constructor
NVMeMiFake worker thread started: 0
NVMeMiFake destroyer
job done
job done
^C
drakedog2008 commented 9 months ago

w/ the changes in comment#


jianghao@jianghao:/workspaces/dbus-sensors$ git log -1 --oneline
9b6bc12 (HEAD -> unit-test) nvmesensor: setup the environment for unit-test
jianghao@jianghao:/workspaces/dbus-sensors$ ./builddir/tests/test_nvme 
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from NVMeTest
...
[       OK ] NVMeTest.TestDriveFunctional (6009 ms)
[ RUN      ] NVMeTest.TestDriveAbsent
NVMeMiFake constructor
NVMeMiFake worker thread started: 0
NVMeMiFake destroyer
job done
job done
NVMeMi worker this line should not be reached
unknown file: Failure
C++ exception with description "sd_bus_add_object_vtable: org.freedesktop.DBus.Error.FileExists: File exists" thrown in the test fixture's constructor.
[  FAILED  ] NVMeTest.TestDriveAbsent (100 ms)
[----------] 3 tests from NVMeTest (9120 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (9126 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NVMeTest.TestDriveAbsent
amboar commented 9 months ago

My testing suggested that if you invoke ./builddir/tests/test_nvme with --gtest_filter=NVMeTest.TestDriveAbsent it should pass with my patches applied (and not pass without them). Given that the option causes the test to be run on its own, it indicates an problem with the test fixture. I made a few attempts at addressing it but didn't find anything that immediately worked.

drakedog2008 commented 9 months ago

Second, we need this patch:

https://gbmc-review.googlesource.com/c/dbus-sensors/+/15014/1

The reason is detailed in the commit msg.

drakedog2008 commented 9 months ago

The NS related change definitely made the resource destruction more async than before (i.e. pushing more async tasks for destruction after the original io.post().

That is why the unit test failed after these ns changes.

drakedog2008 commented 9 months ago

The unit test is quite happy now:

ninja: Entering directory `/workspaces/dbus-sensors/builddir'
[8/8] Linking target tests/test_nvme
1/1 test_nvme        OK             17.19s

Ok:                 1   
Expected Fail:      0   
Fail:               0   
Unexpected Pass:    0   
Skipped:            0   
Timeout:            0   

Full log written to /workspaces/dbus-sensors/builddir/meson-logs/testlog.txt

With a small tweak of course (will fix):

--- a/src/NVMeSubsys.hpp
+++ b/src/NVMeSubsys.hpp
@@ -174,7 +174,7 @@ class NVMeSubsystem :

     // a counter to skip health poll when NVMe subsystem becomes Unavailable
     unsigned UnavailableCount = 0;
-    static constexpr unsigned UnavailableMaxCount = 60;
+    static constexpr unsigned UnavailableMaxCount = 1;

     // process Secondary controller and start controllers and the associated
     // Plugin
drakedog2008 commented 9 months ago

My testing suggested that if you invoke ./builddir/tests/test_nvme with --gtest_filter=NVMeTest.TestDriveAbsent it should pass with my patches applied (and not pass without them). Given that the option causes the test to be run on its own, it indicates an problem with the test fixture. I made a few attempts at addressing it but didn't find anything that immediately worked.

W/o the fix, the unit is still complaining about the mock leak when running a single test case:

jianghao@jianghao:/workspaces/dbus-sensors$ MALLOC_PERTURB_=40 LD_LIBRARY_PATH=/workspaces/dbus-sensors/builddir/subprojects/pdi:/workspaces/dbus-sensors/builddir/tests:/usr/local/lib /workspaces/dbus-sensors/builddir/tests/test_nvme --gtest_filter=NVMeTest.TestDriveAbsent
Note: Google Test filter = NVMeTest.TestDriveAbsent
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NVMeTest

...
[  PASSED  ] 1 test.

../tests/test_nvme_mi.cpp:24: ERROR: this mock object (used in test NVMeTest.TestDriveAbsent) should be deleted but never is. Its address is @0x557a584f3d70.
ERROR: 1 leaked mock object found at program exit. Expectations on a mock object are verified when the object is destructed. Leaking a mock means that its expectations aren't verified, which is usually a test bug. If you really intend to leak a mock, you can suppress this error using testing::Mock::AllowLeak(mock_object), or you may use a fake or stub instead of a mock.
drakedog2008 commented 9 months ago

On a normal environment beyond unit test, the nature of async for destructor should not be a problem. the recreate should be either triggered by: 1) a success health poll, which has a one second polling internal; 2) or a E-M re-enum, which is handled by this cl: https://gbmc-review.googlesource.com/c/dbus-sensors/+/12146

drakedog2008 commented 8 months ago

With @amboar's fixes^1 together with these two changes ^2, all unit test passed including subsystem recreation.

Furthur improvement will be tracked by the new issue

amboar commented 8 months ago

@drakedog2008 I think collectively that addresses #2 as well?

drakedog2008 commented 8 months ago

I think they are different. #2 is trying to remove the dependency on O-M and use dbus matcher as observer instead.

The main concern is that current unit test takes ~20 second which is really close to the default meson timeout for unitest.