chipsalliance / caliptra-sw

Caliptra software (ROM, FMC, runtime firmware), and libraries/tools needed to build and test
Apache License 2.0
48 stars 35 forks source link

Intermittent test failure on FPGA: ECC verify-after-sign failure #1201

Open korran opened 6 months ago

korran commented 6 months ago

From https://github.com/chipsalliance/caliptra-sw/actions/runs/7202132618/attempts/1 (running on caliptra-fpga-svl-0)

        186 UART: Running Caliptra FMC ...
        186 UART: 
        186 UART: [state] CFI Enabled
<snip>
        355 UART: [alias rt] Derive Key Pair - Done
        355 UART: [alias rt] Signing Cert with AUTHO
        355 UART:             RITY.KEYID = 7
        470 UART: [FMC] CFI Panic code=0x01040055
        470 UART: Fatal Error: 0x01040055
        470 >>> mbox cmd response: failed
test test_image_validation::test_preamble_vendor_lms_optional_no_pubkey_revocation_check ... FAILED

failures:

failures:
    test_image_validation::test_preamble_vendor_lms_optional_no_pubkey_revocation_check

On a second attempt, the test passed.

The 0x01040055 error code is emitted here:

https://github.com/chipsalliance/caliptra-sw/blob/68bbf77bc0c4190177d46d20003db5e5e1551b46/cfi/lib/src/cfi.rs#L207-L214

Inferring from the logs, this looks to be called here:

https://github.com/chipsalliance/caliptra-sw/blob/68bbf77bc0c4190177d46d20003db5e5e1551b46/drivers/src/ecc384.rs#L382-L383

which is called from here:

https://github.com/chipsalliance/caliptra-sw/blob/68bbf77bc0c4190177d46d20003db5e5e1551b46/fmc/src/flow/crypto.rs#L156

Which is called from here:

https://github.com/chipsalliance/caliptra-sw/blob/68bbf77bc0c4190177d46d20003db5e5e1551b46/fmc/src/flow/rt_alias.rs#L324

This failure is the verification check we do on all generated signatures to prevent glitched numerical state from being exposed to the outside world. This issue is probably caused by a numerical issue inside the ECC peripheral. Could just be an issue with the FPGA timing constraints, but I think we need to do more investigation / stress testing of the ECC driver and peripheral, as well as looking into intermittent problems with the FPGA test bench.

Lines of investigation (check when addressed):

jlmahowa-amd commented 6 months ago

Pull-ups and pull-downs have been added, also for the default build options I disabled JTAG.

Attached is the timing report for release_v20231221_0, with JTAG enabled and pullups removed to match the settings used when the error was detected. report_timing_summary.txt There are no reported timing violations. One area I intend to investigate further is that in the script 20MHz is specified as the target frequency for the PL clock, however due to the potential clock divider values the timing report is run against 19.753MHz. I would like to verify the PL clock frequency that is seen by the PL when the PLL is configured for 20000000 from the FPGA setup.

Currently I am running the failing test in a loop against the image built with JTAG enabled and pullups removed to try to reproduce.

jlmahowa-amd commented 6 months ago

I built an FPGA image to observe the PL clock. When 20MHz is requested the measured frequency is dead on but for other frequencies the clock could be off. Since Vivado does the timing against 19.753 we should consider setting it to 19700000. Based on the results I got from my other test, it doesn't seem to be a factor. Either the KAT test can not reproduce the ECC failure or the frequency is not a big factor. image

I ran the ECC KAT in a loop using the git version from the failure link above (9200f90c4166be11ff4a160890a1173f365fe061) at 20MHz. Test was stopped by a network interruption after > 2.4 million cycles. I attempted to reproduce the failure by increasing the PL clock until failure. 115MHz is the lowest frequency I have observed a failure (load access fault exception). At 107MHz (requested 110MHz) it can run hundreds of thousands of iterations.