Open SalDaniele opened 1 year ago
We also experienced such behavior with the current master_devel branch of the upcoming 4.26 release. Do you plan to fix this with 4.26?
We use mstflint together with a mainline kernel 6.5 at the moment.
Update: with kernel 6.6 and a more recent rdma-core version we were able to trigger a reset succesfully from BF2. But ran into the 60s timeout. dmesg showed, that the reset worked.
Had a conversation with the owner from our side. Direction was: please use mstflint-4.28 (just released) and the latest available driver to flash the latest published firmware. Please also query the device with "mstfwreset" (mstfwreset -d DEVICE q). It will list "sync"-capabilities for you.
Something like:
mstfwreset -d 81:00.0 q
<some output omitted>
Reset-sync (relevant only for reset-level 3):
0: Tool is the owner -Not supported
1: Driver is the owner -Supported (default)
For "sync 0" - tool is the owner of reset flow and reset command should be issued from both host and arm side For "sync 1" - driver is the owner
Hi @ogalbxela ,
Even we faced the same error of "Synchronization by driver is not supported in the current state of this device." during mstfwreset. Device we are using is as below ~]$ sudo lshw -class network -businfo | grep BlueField-2 pci@0000:17:00.0 ens2f0np0 network MT42822 BlueField-2 integrated ConnectX-6 Dx network controller pci@0000:17:00.1 ens2f1np1 network MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
As suggested in the thread above we updated the Firmware to latest (24.42.1000) and also took v4.29 mstflint latest. But still we are facing issue during mstfwreset. (Even with v4.28 it gives the same error as in v4.29) ~]$ ethtool -i ens2f0np0 driver: mlx5_core version: 5.14.0-427.42.1.el9_4.x86_64 firmware-version: 24.42.1000 (MT_0000000765) expansion-rom-version: bus-info: 0000:17:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes [cloud-admin@compute-1 ~]$
~]$ sudo mstflint --version mstflint, mstflint 4.29.0, Git SHA Hash: 37af981
~]$ sudo mstfwreset -d 17:00.0 q
Reset-levels: 0: Driver, PCI link, network link will remain up ("live-Patch") -Not Supported 1: Only ARM side will not remain up ("Immediate reset"). -Not Supported 3: Driver restart and PCI reset -Supported (default) 4: Warm Reboot -Supported
Reset-types (relevant only for reset-levels 1,3,4): 0: Full chip reset -Supported (default) 1: Phy-less reset (keep network port active during reset) -Not Supported 2: NIC only reset (for SoC devices) -Not Supported 3: ARM only reset -Not Supported 4: ARM OS shut down -Not Supported
Reset-sync (relevant only for reset-level 3): 0: Tool is the owner -Not supported 1: Driver is the owner -Supported (default)
Reset-reason: Warm reset
Timestamp (number of clock cycles) since last cold reset: 1308350112
Note that the Reset-sync (sync 1) shows supported. But still it is not working. ~]$ sudo mstfwreset --device 0000:17:00.0 --level 3 -y r
The reset level for device, 0000:17:00.0 is:
3: Driver restart and PCI reset Please be aware that resetting the Bluefield may take several minutes. Exiting the process in the middle of the waiting period will not halt the reset. The ARM side will be restarted, and it will be unavailable for a while. Continue with reset?[y/N] y -I- Sending Reset Command To Fw --E- The BF reset flow encountered a failure due to a reset state error of negotiation dis-acknowledgment. [cloud-admin@compute-1 ~]$
Using mstflint compiled from source code:
On a Bluefield-2 w/ BMC
I am trying to update the fw to the latest version. After running mstflint -d -i <.bin> burn, this is the state of the bluefield
mstfwreset fails with the following error:
If I disable sync and run this again, it hangs on waiting for other hosts, and times out
I can skip the fsm sync but this results in the fw reset failing without a particular error message
Note that I tried rebooting the host machine at this point, however the fw update has not been applied after reboot.
The only way I have found to apply the updated firmware is to switch the device to "NIC mode", after which fwreset is able to successfully apply the pending configurations, as well as switch to the updated fw version.