With extensive testing it is found that the qemu-zcu102 configuration still has a 1% boot failure rate.
This is similar to the issue in v2022.12 that was worked around by using PREMPT_NONE and many other hacks. The time for such hacks is over so I did not repeat them in v2024.05.
The failure appears to be an RCU timeout while doing some sort of power domain operation. See test-logs/2024-07-23-ec2 in this repo for multiple logs and test run documentation.
The ZCU102 HW platform should be tested for 500 cycles to ensure this is not a common issue with v6.6 kernel and v2024.1 boot firmware. If the HW does not have the problem, then the qemu-zcu102 firmware upgrade (Issue #4 ) may solve the issue. Otherwise AMD will need to investigate.
With extensive testing it is found that the qemu-zcu102 configuration still has a 1% boot failure rate.
This is similar to the issue in v2022.12 that was worked around by using PREMPT_NONE and many other hacks. The time for such hacks is over so I did not repeat them in v2024.05.
The failure appears to be an RCU timeout while doing some sort of power domain operation. See test-logs/2024-07-23-ec2 in this repo for multiple logs and test run documentation.
The ZCU102 HW platform should be tested for 500 cycles to ensure this is not a common issue with v6.6 kernel and v2024.1 boot firmware. If the HW does not have the problem, then the qemu-zcu102 firmware upgrade (Issue #4 ) may solve the issue. Otherwise AMD will need to investigate.