As thermal's design, when only one fan rotor fails, the Inlet Sensor PWM should run based on "failUpTable" or "failDownTable". While setting one fan rotor to fail(not present), the code always runs based on "normalUpTable" & "normalDownTable" instead of "failUpTable" & "failDownTable". It's due to the "numFanFailed_" is always 0 in sensor PWM calculation process. So I move the "STEP 1" and "STEP 2" code sections to after "STEP 3". When the number of failed fan rotors >= 2, then fan fail boost mode will be enabled. Accordingly, I modify the value of "pwmBoostOnNumDeadFan" in fan_service.json file to 2 for test.
Test log
When one fan rotor fail:
I0912 22:38:40.053123 18101 ControlLogic.cpp:97] Successfully fetched sensor data.
E0912 22:38:40.053373 18101 Bsp.cpp:227] Failed to read optics data from Qsfp for qsfp_group_1, exception: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0912 22:38:40.053384 18101 ControlLogic.cpp:107] Successfully fetched optics data.
I0912 22:38:40.053389 18101 ControlLogic.cpp:552] Processing Fans ...
I0912 22:38:40.053415 18101 ControlLogic.cpp:365] FANTRAY1_FAN1: is absent in the host
I0912 22:38:40.065823 18101 ControlLogic.cpp:166] FANTRAY1_FAN2: RPM read is 9830
I0912 22:38:40.065845 18101 ControlLogic.cpp:166] FANTRAY1_FAN3: RPM read is 7992
I0912 22:38:40.065864 18101 ControlLogic.cpp:166] FANTRAY1_FAN4: RPM read is 9830
I0912 22:38:40.065882 18101 ControlLogic.cpp:166] FANTRAY1_FAN5: RPM read is 7992
I0912 22:38:40.065902 18101 ControlLogic.cpp:166] FANTRAY1_FAN6: RPM read is 9830
I0912 22:38:40.065921 18101 ControlLogic.cpp:166] FANTRAY1_FAN7: RPM read is 7992
I0912 22:38:40.065940 18101 ControlLogic.cpp:166] FANTRAY1_FAN8: RPM read is 9637
I0912 22:38:40.078335 18101 ControlLogic.cpp:166] FANTRAY2_FAN1: RPM read is 7992
I0912 22:38:40.078355 18101 ControlLogic.cpp:166] FANTRAY2_FAN2: RPM read is 9830
I0912 22:38:40.078372 18101 ControlLogic.cpp:166] FANTRAY2_FAN3: RPM read is 7992
I0912 22:38:40.078388 18101 ControlLogic.cpp:166] FANTRAY2_FAN4: RPM read is 9830
I0912 22:38:40.078406 18101 ControlLogic.cpp:166] FANTRAY2_FAN5: RPM read is 7992
I0912 22:38:40.078425 18101 ControlLogic.cpp:166] FANTRAY2_FAN6: RPM read is 9830
I0912 22:38:40.078445 18101 ControlLogic.cpp:166] FANTRAY2_FAN7: RPM read is 7992
I0912 22:38:40.078464 18101 ControlLogic.cpp:166] FANTRAY2_FAN8: RPM read is 9637
I0912 22:38:40.078468 18101 ControlLogic.cpp:598] Processing Sensors ...
E0912 22:38:40.078474 18101 ControlLogic.cpp:241] SMB_U77_INLET_LEFT_BOT_LM75TEMP: Sensor read value (after scaling) is 29
I0912 22:38:40.078480 18101 ControlLogic.cpp:181] CLS deadFanExists: true
I0912 22:38:40.078485 18101 ControlLogic.cpp:192] **CLS failUpTable, numFanFailed: 1
I0912 22:38:40.078490 18101 ControlLogic.cpp:223] SMB_U77_INLET_LEFT_BOT_LM75_TEMP: Calculated PWM is 50**
E0912 22:38:40.078495 18101 ControlLogic.cpp:241] CPU_UNCORE_TEMP: Sensor read value (after scaling) is 45
V0912 22:38:40.078506 18101 PidLogic.cpp:40] Measurement: 45, Error: -45, Last PWM: 50, New PWM: 0
I0912 22:38:40.078510 18101 ControlLogic.cpp:223] CPU_UNCORE_TEMP: Calculated PWM is 0
I0912 22:38:40.078514 18101 ControlLogic.cpp:602] Processing Optics ...
I0912 22:38:40.078519 18101 ControlLogic.cpp:612] Boost mode enabled for optics update missing for 1726151920s
I0912 22:38:40.078525 18101 ControlLogic.cpp:481] zone1: Components: SMB_U77_INLET_LEFT_BOT_LM75_TEMP,CPU_UNCORE_TEMP,qsfp_group_1. Aggregation Type: ZONE_TYPE_MAX. Aggregate PWM is 75.
When 2 fan rotors fail:
I0912 22:42:04.775775 18214 ControlLogic.cpp:97] Successfully fetched sensor data.
E0912 22:42:04.776026 18214 Bsp.cpp:227] Failed to read optics data from Qsfp for qsfp_group_1, exception: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0912 22:42:04.776037 18214 ControlLogic.cpp:107] Successfully fetched optics data.
I0912 22:42:04.776043 18214 ControlLogic.cpp:552] Processing Fans ...
I0912 22:42:04.776070 18214 ControlLogic.cpp:365] FANTRAY1_FAN1: is absent in the host
I0912 22:42:04.776084 18214 ControlLogic.cpp:365] FANTRAY1_FAN2: is absent in the host
I0912 22:42:04.776118 18214 ControlLogic.cpp:166] FANTRAY1_FAN3: RPM read is 7992
I0912 22:42:04.776141 18214 ControlLogic.cpp:166] FANTRAY1_FAN4: RPM read is 9830
I0912 22:42:04.776163 18214 ControlLogic.cpp:166] FANTRAY1_FAN5: RPM read is 7992
I0912 22:42:04.776185 18214 ControlLogic.cpp:166] FANTRAY1_FAN6: RPM read is 9637
I0912 22:42:04.776206 18214 ControlLogic.cpp:166] FANTRAY1_FAN7: RPM read is 7992
I0912 22:42:04.776227 18214 ControlLogic.cpp:166] FANTRAY1_FAN8: RPM read is 9637
I0912 22:42:04.776250 18214 ControlLogic.cpp:166] FANTRAY2_FAN1: RPM read is 7992
I0912 22:42:04.776271 18214 ControlLogic.cpp:166] FANTRAY2_FAN2: RPM read is 9637
I0912 22:42:04.776292 18214 ControlLogic.cpp:166] FANTRAY2_FAN3: RPM read is 7992
I0912 22:42:04.776318 18214 ControlLogic.cpp:166] FANTRAY2_FAN4: RPM read is 9830
I0912 22:42:04.776340 18214 ControlLogic.cpp:166] FANTRAY2_FAN5: RPM read is 7992
I0912 22:42:04.776363 18214 ControlLogic.cpp:166] FANTRAY2_FAN6: RPM read is 9637
I0912 22:42:04.776384 18214 ControlLogic.cpp:166] FANTRAY2_FAN7: RPM read is 7992
I0912 22:42:04.776405 18214 ControlLogic.cpp:166] FANTRAY2_FAN8: RPM read is 9637
I0912 22:42:04.776410 18214 ControlLogic.cpp:598] Processing Sensors ...
E0912 22:42:04.776417 18214 ControlLogic.cpp:241] SMB_U77_INLET_LEFT_BOT_LM75TEMP: Sensor read value (after scaling) is 28.75
I0912 22:42:04.776424 18214 ControlLogic.cpp:181] CLS deadFanExists: false
I0912 22:42:04.776429 18214 ControlLogic.cpp:186] **CLS normalUpTable, numFanFailed: 2
I0912 22:42:04.776434 18214 ControlLogic.cpp:223] SMB_U77_INLET_LEFT_BOT_LM75_TEMP: Calculated PWM is 45
E0912 22:42:04.776439 18214 ControlLogic.cpp:241] CPU_UNCORE_TEMP: Sensor read value (after scaling) is 49
V0912 22:42:04.776452 18214 PidLogic.cpp:40] Measurement: 49, Error: -49, Last PWM: 50, New PWM: 0
I0912 22:42:04.776456 18214 ControlLogic.cpp:223] CPU_UNCORE_TEMP: Calculated PWM is 0
I0912 22:42:04.776461 18214 ControlLogic.cpp:602] Processing Optics ...
I0912 22:42:04.776466 18214 ControlLogic.cpp:612] Boost mode enabled for optics update missing for 1726152124s
I0912 22:42:04.776470 18214 ControlLogic.cpp:619] Boost mode enabled for 2 fan failures**
I0912 22:42:04.776476 18214 ControlLogic.cpp:481] zone1: Components: SMB_U77_INLET_LEFT_BOT_LM75_TEMP,CPU_UNCORE_TEMP,qsfp_group_1. Aggregation Type: ZONE_TYPE_MAX. Aggregate PWM is 75.
Description
As thermal's design, when only one fan rotor fails, the Inlet Sensor PWM should run based on "failUpTable" or "failDownTable". While setting one fan rotor to fail(not present), the code always runs based on "normalUpTable" & "normalDownTable" instead of "failUpTable" & "failDownTable". It's due to the "numFanFailed_" is always 0 in sensor PWM calculation process. So I move the "STEP 1" and "STEP 2" code sections to after "STEP 3". When the number of failed fan rotors >= 2, then fan fail boost mode will be enabled. Accordingly, I modify the value of "pwmBoostOnNumDeadFan" in fan_service.json file to 2 for test.
Test log
When one fan rotor fail: I0912 22:38:40.053123 18101 ControlLogic.cpp:97] Successfully fetched sensor data. E0912 22:38:40.053373 18101 Bsp.cpp:227] Failed to read optics data from Qsfp for qsfp_group_1, exception: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused I0912 22:38:40.053384 18101 ControlLogic.cpp:107] Successfully fetched optics data. I0912 22:38:40.053389 18101 ControlLogic.cpp:552] Processing Fans ... I0912 22:38:40.053415 18101 ControlLogic.cpp:365] FANTRAY1_FAN1: is absent in the host I0912 22:38:40.065823 18101 ControlLogic.cpp:166] FANTRAY1_FAN2: RPM read is 9830 I0912 22:38:40.065845 18101 ControlLogic.cpp:166] FANTRAY1_FAN3: RPM read is 7992 I0912 22:38:40.065864 18101 ControlLogic.cpp:166] FANTRAY1_FAN4: RPM read is 9830 I0912 22:38:40.065882 18101 ControlLogic.cpp:166] FANTRAY1_FAN5: RPM read is 7992 I0912 22:38:40.065902 18101 ControlLogic.cpp:166] FANTRAY1_FAN6: RPM read is 9830 I0912 22:38:40.065921 18101 ControlLogic.cpp:166] FANTRAY1_FAN7: RPM read is 7992 I0912 22:38:40.065940 18101 ControlLogic.cpp:166] FANTRAY1_FAN8: RPM read is 9637 I0912 22:38:40.078335 18101 ControlLogic.cpp:166] FANTRAY2_FAN1: RPM read is 7992 I0912 22:38:40.078355 18101 ControlLogic.cpp:166] FANTRAY2_FAN2: RPM read is 9830 I0912 22:38:40.078372 18101 ControlLogic.cpp:166] FANTRAY2_FAN3: RPM read is 7992 I0912 22:38:40.078388 18101 ControlLogic.cpp:166] FANTRAY2_FAN4: RPM read is 9830 I0912 22:38:40.078406 18101 ControlLogic.cpp:166] FANTRAY2_FAN5: RPM read is 7992 I0912 22:38:40.078425 18101 ControlLogic.cpp:166] FANTRAY2_FAN6: RPM read is 9830 I0912 22:38:40.078445 18101 ControlLogic.cpp:166] FANTRAY2_FAN7: RPM read is 7992 I0912 22:38:40.078464 18101 ControlLogic.cpp:166] FANTRAY2_FAN8: RPM read is 9637 I0912 22:38:40.078468 18101 ControlLogic.cpp:598] Processing Sensors ... E0912 22:38:40.078474 18101 ControlLogic.cpp:241] SMB_U77_INLET_LEFT_BOT_LM75TEMP: Sensor read value (after scaling) is 29 I0912 22:38:40.078480 18101 ControlLogic.cpp:181] CLS deadFanExists: true I0912 22:38:40.078485 18101 ControlLogic.cpp:192] **CLS failUpTable, numFanFailed: 1 I0912 22:38:40.078490 18101 ControlLogic.cpp:223] SMB_U77_INLET_LEFT_BOT_LM75_TEMP: Calculated PWM is 50** E0912 22:38:40.078495 18101 ControlLogic.cpp:241] CPU_UNCORE_TEMP: Sensor read value (after scaling) is 45 V0912 22:38:40.078506 18101 PidLogic.cpp:40] Measurement: 45, Error: -45, Last PWM: 50, New PWM: 0 I0912 22:38:40.078510 18101 ControlLogic.cpp:223] CPU_UNCORE_TEMP: Calculated PWM is 0 I0912 22:38:40.078514 18101 ControlLogic.cpp:602] Processing Optics ... I0912 22:38:40.078519 18101 ControlLogic.cpp:612] Boost mode enabled for optics update missing for 1726151920s I0912 22:38:40.078525 18101 ControlLogic.cpp:481] zone1: Components: SMB_U77_INLET_LEFT_BOT_LM75_TEMP,CPU_UNCORE_TEMP,qsfp_group_1. Aggregation Type: ZONE_TYPE_MAX. Aggregate PWM is 75.
When 2 fan rotors fail: I0912 22:42:04.775775 18214 ControlLogic.cpp:97] Successfully fetched sensor data. E0912 22:42:04.776026 18214 Bsp.cpp:227] Failed to read optics data from Qsfp for qsfp_group_1, exception: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused I0912 22:42:04.776037 18214 ControlLogic.cpp:107] Successfully fetched optics data. I0912 22:42:04.776043 18214 ControlLogic.cpp:552] Processing Fans ... I0912 22:42:04.776070 18214 ControlLogic.cpp:365] FANTRAY1_FAN1: is absent in the host I0912 22:42:04.776084 18214 ControlLogic.cpp:365] FANTRAY1_FAN2: is absent in the host I0912 22:42:04.776118 18214 ControlLogic.cpp:166] FANTRAY1_FAN3: RPM read is 7992 I0912 22:42:04.776141 18214 ControlLogic.cpp:166] FANTRAY1_FAN4: RPM read is 9830 I0912 22:42:04.776163 18214 ControlLogic.cpp:166] FANTRAY1_FAN5: RPM read is 7992 I0912 22:42:04.776185 18214 ControlLogic.cpp:166] FANTRAY1_FAN6: RPM read is 9637 I0912 22:42:04.776206 18214 ControlLogic.cpp:166] FANTRAY1_FAN7: RPM read is 7992 I0912 22:42:04.776227 18214 ControlLogic.cpp:166] FANTRAY1_FAN8: RPM read is 9637 I0912 22:42:04.776250 18214 ControlLogic.cpp:166] FANTRAY2_FAN1: RPM read is 7992 I0912 22:42:04.776271 18214 ControlLogic.cpp:166] FANTRAY2_FAN2: RPM read is 9637 I0912 22:42:04.776292 18214 ControlLogic.cpp:166] FANTRAY2_FAN3: RPM read is 7992 I0912 22:42:04.776318 18214 ControlLogic.cpp:166] FANTRAY2_FAN4: RPM read is 9830 I0912 22:42:04.776340 18214 ControlLogic.cpp:166] FANTRAY2_FAN5: RPM read is 7992 I0912 22:42:04.776363 18214 ControlLogic.cpp:166] FANTRAY2_FAN6: RPM read is 9637 I0912 22:42:04.776384 18214 ControlLogic.cpp:166] FANTRAY2_FAN7: RPM read is 7992 I0912 22:42:04.776405 18214 ControlLogic.cpp:166] FANTRAY2_FAN8: RPM read is 9637 I0912 22:42:04.776410 18214 ControlLogic.cpp:598] Processing Sensors ... E0912 22:42:04.776417 18214 ControlLogic.cpp:241] SMB_U77_INLET_LEFT_BOT_LM75TEMP: Sensor read value (after scaling) is 28.75 I0912 22:42:04.776424 18214 ControlLogic.cpp:181] CLS deadFanExists: false I0912 22:42:04.776429 18214 ControlLogic.cpp:186] **CLS normalUpTable, numFanFailed: 2 I0912 22:42:04.776434 18214 ControlLogic.cpp:223] SMB_U77_INLET_LEFT_BOT_LM75_TEMP: Calculated PWM is 45 E0912 22:42:04.776439 18214 ControlLogic.cpp:241] CPU_UNCORE_TEMP: Sensor read value (after scaling) is 49 V0912 22:42:04.776452 18214 PidLogic.cpp:40] Measurement: 49, Error: -49, Last PWM: 50, New PWM: 0 I0912 22:42:04.776456 18214 ControlLogic.cpp:223] CPU_UNCORE_TEMP: Calculated PWM is 0 I0912 22:42:04.776461 18214 ControlLogic.cpp:602] Processing Optics ... I0912 22:42:04.776466 18214 ControlLogic.cpp:612] Boost mode enabled for optics update missing for 1726152124s I0912 22:42:04.776470 18214 ControlLogic.cpp:619] Boost mode enabled for 2 fan failures** I0912 22:42:04.776476 18214 ControlLogic.cpp:481] zone1: Components: SMB_U77_INLET_LEFT_BOT_LM75_TEMP,CPU_UNCORE_TEMP,qsfp_group_1. Aggregation Type: ZONE_TYPE_MAX. Aggregate PWM is 75.
Attach the detailed log and test fan_service.json files: 1fanfail_testlog.txt 2fanfail_testlog.txt fan_service.json