5. Implement TPM command parsing and communication between FPGA and MCU

Dasharo / twpm-docs

Trustworthy Platform Module (TwPM) documentation

https://twpm.dasharo.com

5 stars 0 forks source link

5. Implement TPM command parsing and communication between FPGA and MCU #20

Closed BeataZdunczyk closed 6 months ago

BeataZdunczyk commented 11 months ago

Minimal parsing of commands and responses (limited to just their sizes) must be done on FPGA side in order to properly set status bits that host can use to check whether TPM expects more bytes of command or has more bytes of response. Full command parsing and execution takes place on MCU, so FPGA has to implement and expose buffer with command sent by host, along with any required metadata like type of message in the buffer or currently active locality.

Milestones:

[x] implementation of FIFO on FPGA
[x] implementation of command machine state
[x] design, implementation and documentation of protocol of communication between FPGA and MCU
[x] application code for reading data from FIFO, passing parsed commands to the TPM stack and writing responses back to FIFO
[x] review and update existing documentation, add entry to changelog

arturkow2 commented 10 months ago

We have at least partially working communication between MCU and FPGA, however we are unable to fully test it due to non-working LPC controller (more on this below). MCU-side firmware is partially done, before finishing this, we must fix FPGA.

The problem with FPGA is a non-working LPC controller, this is caused by timing issues and too high delays. We tried to solve this problem by manually writing timing constraints which LPC controller must meet to work properly. However, we ran into issues with the toolchain we use. Initially, we were using Qorc SDK which contains official tools for FPGA synthesis. Later on, we tried using F4PGA.

The first problem is that we are unable to use any of pads designed to be used as clock input as FPGA does not receive any signal on those pads. This looks like a bug in SDK. Another problem is that we are not able to use timing constraints. Qorc SDK contains an old version of VPR (part of Verilog to Routing) which has a number of problems.

The first problem is that timings don't propagate properly through clock buffers if we use negative-edge trigger on that clock. Also, we are unable to constraint the clock buffer directly nor anything that isn't an input or output pin and if creating a constrained clock, the clock must use one of special clock pins (which don't work). Same is true for clocks coming Cortex-M4, such as Wishbone clock. Inability to constraint Wishbone clock, as well as to manually propagate clocks through buffers results in all connections being unconstrained:

Warning 105: 29 timing startpoints were not constrained during timing analysis
Warning 106: 353 timing endpoints were not constrained during timing analysis

Those problems seem to be partially related to

https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/1379 - prevents constraining everything what is not in input/output pin
https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/1013#issuecomment-545519248

Due to problems with Qorc SDK we tried using F4PGA (this is a continuation of Symbiflow, which was used in Qorc SDK), however we ran into similar problems. Clock input pads are still not working, what's more, newer Yosys detects clocks and moves them into PB-CLOCK forcing us to use dedicated clock pads (which still don't work). While the problem with timing constraints being discarded is gone (it is possible to constraint LPC clock without constraining Wishbone). Timings don't propagate to procedural blocks if we use negative-edge sensitivity, nor when we invert clock signal.

Due to these issues we are going to try Quicklogic proprietary tools. Those tools were used to synthesize usb2ser which works on 48 Mhz, this gives a better chance to get things working until problems with open-source tools are solved.

krystian-hebel commented 10 months ago

Tests performed by manually connecting relevant signals and flipping clock signal show that with lower frequencies (below 1 Hz, wires were physically moved around on breadboard) it i possible to read TPM registers. This reassured us that it is indeed problem with timing issues. However, this exposed another issue.

As part of the test, TPM_ACCESS register was read before any other operation, with the result of 8'b10100000, i.e. tpmRegValidSts and activeLocality bits were set. The value is set by the following code (source):

          `TPM_ACCESS: begin
            data_o <= {/* tpmRegValidSts */ 1'b1, /* Reserved */ 1'b0,
                     addrLocality === activeLocality ? 1'b1 : 1'b0,
                     beenSeized[addrLocality], /* Seize, write only */ 1'b0,
                     /* pendingRequest */ |(requestUse & ~(5'h01 << addrLocality)),
                     requestUse[addrLocality], tpmEstablishment};
          end

At this point expected register value is 8'b10000001, as no locality should be selected and tpmEstablishment is currently hardcoded to 1'b1. This discrepancy most likely comes from a fact that initial register values are not handled correctly by yosys, at least for this platform. Further tests will be needed to see if this applies to all initial values other than 0, or only those that aren't set by asynchronous resets.

arturkow2 commented 10 months ago

We tried to QuickLogic's proprietary tools - the latest version (and the only version that supports EOS-S3) is available on GitHub, that version uses yosys for synthesis (instead of Precision Synthesis) and SpDE for PnR. We saw some improvements, most notably the problem non-working CLOCK cells is gone and it is possible to use both positive and negative triggered procedural blocks and timings do propagate properly. However, other issues showed off, and we were unable to get LPC working at the target frequency. We could achieve working LPC only when most of the code was commented out - basically we left only LPC (without SERIRQ), removed Wishbone, RAM, and any registers (responding with a single constant value to any reads). Even then, reads were successful at approximately 50% rate.

SpDE has some bugs which result in wrong maximum frequency being reported, so we had to resort to manual analysis of propagation delays which turned to be complicated as SpDE does not report critical path. We were also unable to set most of constraints, such as placement constraints, max fanout, or even false path. Attempt to set those constraints resulted in SpDE crashes.

Addition of any modules, such as SERIRQ, or even those not related directly to LPC, such as Wishbone resulted in propagation delay increase between nets used by LPC controller. In effect the controller breaks completely even after minor changes. This suggests that the resources available in FPGA are not enough to build a functional design with LPC controller, and additional interfaces we need.

Due to these problems we resign from using QuickLogic EOS-S3 and we will look for another platform to use.

arturkow2 commented 9 months ago

We decided to use Lattice ECP5. For development purposes we will use either Radiona ULX3S (FPGA variant 25F or higher) or ORANGECRAB-R0D2-25 whichever is more easily available.

ECP5-25F contains near 128 KiB of BlockRAM which will be enough to implement FIFO for LPC controller and SRAM for softcpu. We have 3 choices for a softcpu implementation: NeoRV32, VexRiscV and PicoRV32. NeoRV32 is the most complete implementation and forms a full SoC, while other interfaces need to be provided from external sources, such a source can be LiteX.

NeoRV32 does have peripherals we need, namely:

TRNG - which is required as entropy source for TPM do be secure
SPI master - required for persistent storage
SPI slave

Unfortunately, NeoRV32 SPI master and slave are not supported by Zephyr. In case of LiteX we have LiteSPI which does have a basic support in Zephyr, however LiteSPI itself does not support operation as a slave.

NeoRV32 contains everything we need except for LPC controller which can be integrated through Wishbone. So, this would be probably the best CPU. Other CPUs could be usable if we decide to port TwPM to smaller FPGAs.

As for SPI implementation on EOS-S3 we could try to:

use NeoRV32 implementation of SDI
use SPI slave from LiteX (not LiteSPI, this part of main LiteX repo): https://github.com/enjoy-digital/litex/blob/master/litex/soc/cores/spi/spi_slave.py
create our own SPI implementation

arturkow2 commented 9 months ago

Currently we are working on SPI on implementation, SPI has a chance to work as it requires lower frequency and is simpler than LPC. For testing we are using LiteX. So far, I made a few bug fixes in F4PGA:

https://github.com/chipsalliance/f4pga/pull/666 Notably, the problem with non-working clock pads which was affecting both QORC-SDK and F4PGA is now fixed (see https://lists.chipsalliance.org/g/f4pga-wg/topic/quicklogic_eos_s3_support_in/101172445?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,101172445,previd%3D1694183937454173600,nextid%3D1654038489998356408&previd=1694183937454173600&nextid=1654038489998356408)

Additionally, a few fixes to LiteX were needed:

I got my SPI test code to build, however most of it is optimized out, I'm not sure yet whether I'm missing something or there is something else wrong. Hopefully, tomorrow I will solve remaining issues and get meaningful results. With that, we will know whether SPI can work on EOS-S3.

The test code is available here (litex branch, spi_litex subdirectory).

arturkow2 commented 9 months ago

Synthesis of very simple SPI code that responds with constant value (https://github.com/Dasharo/twpm-f4pga-tests/blob/f12b1790de2920376bac80822efe44fc3baef026/spi_litex/test_spi.py) gave the result:

{
  "cpd": 21.357,
  "fmax": 46.8232,
  "swns": -4.69029,
  "stns": -80.2244
}

This is the result for CPU clock domain (on which SPI depends). Results don't look promising, taking the fact that real circuit will be more complex. I couldn't constraint SPI clock port as VPR keeps claiming there is no such port/net:

stderr:
Error 1: 
Type: SDC file
File: /home/akowalski/projekty/twpm/eos_s3_test_spi/spi_litex/build/eos_spi/gateware/build/eos_spi.sdc
Line: 1
Message: Clock name or pattern 'spi_clk' does not correspond to any nets. To create a virtual clock, use the '-name' option.

So far I found out that this happens if we use clock as data input instead of as clock input for FFs. Litex SPI slave uses probes SPI clock on CPU domain clock edge: https://github.com/enjoy-digital/litex/blob/6ab156e2253b3a832203d726fdb04f069894adf8/litex/soc/cores/spi/spi_slave.py#L56-L62. We actually hit this bug before, when using negative trigger on clock (always @(negedge clk) was causing the same effect.

I tried to add gclkbuff manually to force VPR detecting this as clock input, which indeed worked and VPR properly detected SPI clock, however still it's not possible to constrain it. I'm going to prepare minimum repro of this issue and file a bug to VPR.

arturkow2 commented 9 months ago

I tried to finish LiteX-based SPI controller as well as creating custom SPI controller from scratch in Verilog after the first attempt failed. Unfortunately, despite better timings with Verilog controller, the target frequency was not achieved.

Latest version of LiteX controller is available here. It supports reading and writing of single register, the maximum frequency as reported by VPR was 30 MHz. However, SPI controller runs entirely in CPU clock domain, sampling SPI signals (including SPI clock) at positive edge of CPU clock, so we need to run at a frequency twice of SPI frequency.

I tried with custom SPI controller written in Verilog (available here). The controller follows TPM protocol, implements wait states and register writes. Register writes and some other mandatory features are not implemented as maximum frequency quickly dropped the more logic I added, resulting with frequency of 23 MHz at the minimum required to write single byte to register.

We decided to resign further work on EOS-S3 and continue on ECP5 platforms. The half-baked SPI controller I made can be used as basis for further work on SPI support, however, currently we are implementing LPC.

For further work we are using ORANGECRAB-R0D2-25 with LFE5U-25F, since OrangeCrab does not contain hard CPU we need to use a softcore CPU, our choice is NeoRV32.

In https://github.com/Dasharo/TwPM_toplevel/pull/9 I integrated NeoRV32 and TwPM LPC controller. The design synthesizes and if flashed now it should work up to what is implemented, that is, the softcore should get in BootROM, it should be possible to communicate with it through UART and LPC controller together with TPM register interface should work.

JTAG is enabled NeoRV32 but not connected to anything - ECP5 has internal JTAG which can be used for programming FPGA, the port cannot be "accessed directly" by assigning pads to toplevel port, but can be accessed through JTAGG primitive (see https://github.com/stnolting/neorv32/discussions/28#discussioncomment-6313328). This gives two advantages - allows to debug CPU and program FPGA using the same port and saves us some I/O pads. Currently, all generall purpose pads are used for LPC and UART, other pads such analog pads or SPI pads have some on-board components connected, this could possibly interfere with JTAG.

CPU configuration needs fine-tuning some parameters, such as SRAM size, I/D-cache size, and boot-source. NeoRV32 can support XIP from SPI flash.

Overall results with ECP5 were positive, basic design sythesizes and gives good timings:

Info: Max frequency for clock  '$glbnet$LCLK$TRELLIS_IO_IN': 37.41 MHz (PASS at 33.30 MHz)
Info: Max frequency for clock '$glbnet$clk_i$TRELLIS_IO_IN': 65.23 MHz (PASS at 48.00 MHz)
Info: Max frequency for clock             '$glbnet$RAM_CLK': 212.63 MHz (PASS at 12.00 MHz)

arturkow2 commented 8 months ago

We continued work on Neorv32 and ECP5 and we got a working soft core on Orangecrab platform. @krystian-hebel has done some basic tests of LPC controller (link), further tests will require TwPM firmware running on softcore. Currently, @krystian-hebel is working on on-board DRAM access from within SoC as we have problems with usage of internal BlockRAM, limiting CPU SRAM size to ~64 KiB, see this for details.

At this stage Neorv32 should be able to boot from SPI flash, however firmware has not been ported yet, and at least debug versions of firmware may not fit in 64 KiB.

krystian-hebel commented 8 months ago

DRAM initialization added in https://github.com/Dasharo/TwPM_toplevel/pull/11. Initialization is controlled mostly by software, required code was added to bootloader. Unfortunately, it can only start after DDR clocks are stable, which takes relatively long, and CPU is held in reset until that happens. However, LPC and TPM registers modules should work independently of CPU, so host should be able to start sending TPM commands before software TPM stack is available.

DMEM was completely disabled, now stack is located on DRAM. We can use saved BlockRAM for implementing cache, which should help significantly if we decide to execute code directly from flash.

arturkow2 commented 8 months ago

I started firmware porting to Orangecrab (https://github.com/Dasharo/twpm-firmware/pull/3), however the work has stalled as I had to focus on a DRAM problem I discovered during first attempt to boot firmware. The initial symptoms were that NeoRV32 bootloader was printing ERR_EXC (which means that firmware signature is wrong) and halting. I added some debug prints to see what's happening:

Awaiting neorv32_exe.bin... recv: 0x000000fe
recv: 0x000000ca
recv: 0x00000088
recv: 0x00000047
signature: 0x8847feca

ERR_EXE

The problem occurs in get_exe_word, however the code is correct. It turned out, DRAM does not handle properly transfers smaller than 4 bytes which results in bytes being shifted around. We tried few solutions to solve the problem, including using D-CACHE to workaround the problem (by making all reads larger than 4-bytes), however this caused another problem, manifesting itself as 24 MSB bits being correct and in correct order, but 8 LSB bits being random trash.

We've been testing LiteX and found out that RAM works properly there, @krystian-hebel has found this which should help us. OrangeCrabs swap LDQS/UDQS lines which cause invalid readings from RAM, we tried to workaround this by changing definitions .lpf which resulted in nextpnr error, but LiteX does seem to have working solution.

arturkow2 commented 8 months ago

After applying workaround mentioned in previous comment RAM seems to work properly. BootROM is working as expected and I can transmit firmware to the CPU over UART, however it does not boot currently (no output on UART). I'm currently debugging this problem, tried to increase Zephyr verbosity and adding debug prints in various places but it didn't work. I have exported Neorv32 JTAG to debug this issue, due to some problem with nextpnr I had to comment-out most components, removing LPC, TPM registers, and leaving only DRAM and CPU. Otherwise nextpnr was freezing.

DRAM fix is available here.

Recent work is located here

arturkow2 commented 8 months ago

I've been trying to get JTAG working, however it was unstable and I couldn't do anything useful with it. I started inspecting Zephyr's code and found out that DTS definitions for Neorv32 are wrong - zephyr's has support for Neorv32 V1.6.1, and memory map has changed since then. I fixed it in https://github.com/Dasharo/twpm-firmware/pull/3 however Zephyr is still not working properly, to see what's happening I added some test code to drive leds (requires changes from https://github.com/Dasharo/TwPM_toplevel/pull/14):

void led_timer_handler(struct k_timer *dummy)
{
    static bool r = false;
    volatile uint32_t *gpio = (volatile uint32_t*)0xfffffc08;
    *gpio = r ? 6 : 3;
    r ^= 1;
}

K_TIMER_DEFINE(led_timer, led_timer_handler, NULL);

void main(void)
{
    volatile uint32_t *gpio = (volatile uint32_t*)0xfffffc08;
    *gpio = 2;

    LOG_INF("Starting TwPM on %s", CONFIG_BOARD);

    k_timer_start(&led_timer, K_SECONDS(1), K_SECONDS(1));
    *gpio = 3;

    ....
}

Bits 0, 1, 2 control red, green, blue leds, respectively. After loading up Zephyr, red and green leds light up, so execution gets to main, then past LOG_INF and k_timer_start. Despite that there is no output from UART, neither Zephyr's greeting nor LOG output.

Blue LED never lights up so probably MTIMER interrupt is not arriving. I don't know what is the cause.

arturkow2 commented 7 months ago

Got Zephyr working:

*** Booting Zephyr OS build zephyr-v3.4.0-2-g71194e41ac04 ***
[00:00:00.005,000] <inf> main: Starting TwPM on orangecrab
[00:00:00.006,000] <wrn> nv: TwPM was built with CONFIG_TWPM_NV_EMULATE. Changes are NOT persistent!
[00:00:02.888,000] <inf> nv: NV commit
[00:00:02.889,000] <inf> init: TPM manufacture OK
[00:00:04.574,000] <inf> nv: NV commit
[00:00:04.576,000] <inf> test: TPM command result: {TPM_RC_SUCCESS}
[00:00:04.697,000] <inf> nv: NV commit
[00:00:04.708,000] <inf> test: HASH: 12f411d0eebfb9c4d81df9f1cb10e22e9841a91428ea7f00969fa7f29db0f7fa

Latest work is available in https://github.com/Dasharo/TwPM_toplevel/pull/14 and https://github.com/Dasharo/twpm-firmware/pull/3.

I've opened draft PR for latest neorv32 support in Zephyr, currently it isn't working and latest Zephyr does not boot properly.

Zephyr v3.4.0 with custom patches works properly as long as NeoTRNG is disabled.

arturkow2 commented 7 months ago

We are working on updating TwPM build environment and integrating https://github.com/Dasharo/twpm-firmware into https://github.com/Dasharo/TwPM_toplevel.

In https://github.com/Dasharo/TwPM_toplevel/pull/15 I added some bits of Zephyr SDK to Nix Flake and added ability to turn SDK in container - for reproducible builds and to make sure nothing depends on host. Development using standard Nix shell will still be possible.

There are some problems with yosys's ABC and build fails both in container and in pure mode:

ERROR: ABC: execution of command ""/nix/store/3jxsm2lwlvw7f64x3ha73hgz7h3m2kaf-abc-verifier-unstable-2023-09-13/bin/abc" -s -f /tmp/yosys-abc-vDMeIG/abc.script 2>&1" failed: return code -1.

If command is run manually from shell then it succeeds.

krystian-hebel commented 7 months ago

We got some progress on LPC interface, reading seems to work reliable on Protectli VP4670. Changes done to make it work:

state encoding changed to Gray (writing TBD, but few simple tests show that it works without it)
switched clock edges for LAD-related always blocks
dropped possibility to abort LPC cycle on SYNC (not sure if needed)
changed default for fsm_next_state to LPC_ST_FORCE_RESET to catch illegal transitions (not needed, but makes debugging easier)
set PULLMODE=UP for LCLK (probably not needed)
set PULLMODE=NONE for other LPC signals
set SLEWRATE=FAST DRIVE=8 for LAD and SERIRQ (interrupts not tested)
SERIRQ line physically disconnected, otherwise random errors during read (code not changed)

This is how test stand looked like. Notice very important pen refill for bending wires so they connect to OrangeCrab properly:

20231220_192134

Example of reading DID/VID register: 20231220_190641

Writing to TPM_ACCESS, which also shows that locality gets activated: 20231220_203308

Reading from other localities' address spaces returns all FFs, but this doesn't even generate any LPC traffic so I expect it is cut off by chipset.

Next step would be to finalize command retrieval and execution on TPM stack side, and after that we can start doing full TPM commands tests. Right now platform doesn't boot when TwPM is powered by USB cable, probably because it gets stuck trying to execute TPM commands.

Oh, one more observation: with our current pinout, LRESET gets pulled low when OrangeCrab runs the programmer. I think this is a feature, because you shouldn't be able to update TPM firmware when it is online, and you shouldn't be able to reset TPM without reseting whole platform.

macpijan commented 7 months ago

nd after that we can start doing full TPM commands tests.

Can't wait to see that, good job :+1:

krystian-hebel commented 6 months ago

Setup like the above didn't work initially after connecting TwPM through proper goldpins instead of ad-hoc "lets-hope-it-connects" approach. This was caused by double ground connection - one through LPC connector to the platform I was testing on (Protectli), and the other one through USB cable connected to USB hub and then my computer. I hadn't noticed any proper read with that setup, although they weren't all FFs either.

Most likely those two grounds created a loop, and this relatively high-frequency connection is susceptible to electromagnetic noise. Disconnecting the LPC ground made things better, after that I got <0.5% error rate. Supplying power through USB connected to Protectli made number of errors go to 0, across 500k reads of 4B register.

Unfortunately, connecting either logic analyzer or UART to the machine other than Protectli used for testing recreates that ground loop, and with that up to 1% error rate (seems to be higher with UART than the analyzer, but maybe that is just the different cabling). This makes debugging and development harder, but shouldn't be a problem in the final solution.

krystian-hebel commented 6 months ago

Current code can be found in https://github.com/Dasharo/TwPM_toplevel/pull/22 and submodules pointed to by it. Communication between host and TPM stack mostly works, including command execution and sending the response back to PC, unfortunately "mostly" isn't enough.

I'm using UART connected to OrangeCrab without ground and it is surprisingly reliable. There are very few random bytes sent every now and then, but I've seen setups where "properly" connected UART was worse than this. Today I used this to get some output from execution of proper TPM2 functions and I haven't noticed any random errors, however there is one nibble in one command that seems to be always wrong. Here's part of the log, as produced by current code when booting Ubuntu 22.04.1:

[00:00:59.899,000] <inf> main: IRQ: op_type = 1, cmd_size = 16, locality = 0                                                                
[00:00:59.900,000] <dbg> main: tpm_thread_entry: TPM command:                                                                               
                               80 01 00 00 00 16 00 00  01 7a 00 00 00 05 00 00 |........ .z......                                          
                               00 00 00 00 00 01                                |......                                                     
[00:00:59.902,000] <inf> main: Executing TPM_CC_GetCapability()                                                                             
[00:00:59.906,000] <dbg> main: tpm_thread_entry: TPM response:                                                                              
                               80 01 00 00 00 2b 00 00  00 00 00 00 00 00 05 00 |.....+.. ........                                          
                               00 00 04 00 04 03 ff ff  ff 00 0b 03 ff ff ff 00 |........ ........                                          
                               0c 03 ff ff ff 00 0d 03  ff ff ff                |........ ...                                               
[00:00:59.907,000] <inf> main: TPM command result: {TPM_RC_SUCCESS}                                                                         
[00:00:59.910,000] <inf> main: IRQ: op_type = 1, cmd_size = 14, locality = 0                                                                
[00:00:59.911,000] <dbg> main: tpm_thread_entry: TPM command:                                                                               
                               80 01 00 00 00 14 00 00  01 7e 00 00 00 01 00 fb |........ .~......                                          
                               03 01 00 00                                      |....                                                       
[00:00:59.913,000] <inf> main: Executing TPM_CC_Hash()                                                                                      
[00:00:59.915,000] <dbg> main: tpm_thread_entry: TPM response:                                                                              
                               80 01 00 00 00 0a 00 00  01 c3                   |........ ..                                                
[00:00:59.917,000] <inf> main: TPM command result: {RC_FMT1 | TPM_RC_P | TPM_RC_1 | TPM_RC_HASH}                                            
[00:00:59.918,000] <inf> main: IRQ: op_type = 1, cmd_size = 14, locality = 0                                                                
[00:00:59.919,000] <dbg> main: tpm_thread_entry: TPM command:                                                                               
                               80 01 00 00 00 14 00 00  01 7e 00 00 00 01 00 fd |........ .~......                                          
                               03 01 00 00                                      |....                                                       
[00:00:59.921,000] <inf> main: Executing TPM_CC_Hash()                                                                                      
[00:00:59.923,000] <dbg> main: tpm_thread_entry: TPM response:                                                                              
                               80 01 00 00 00 0a 00 00  01 c3                   |........ ..                                                
[00:00:59.925,000] <inf> main: TPM command result: {RC_FMT1 | TPM_RC_P | TPM_RC_1 | TPM_RC_HASH}

First command gets supported hash algorithms and their PCRs out of TPM. It properly returns info that there are 4 algorithms, along with their registered IDs (00 04 - SHA1, 00 0b - SHA256, 00 0c - SHA384 and 00 0d - SHA512). This is followed by two calls to TPM2_PCR_Read() (note that it is incorrectly logged as TPM2_Hash() due to a typo that I haven't fixed yet), but this time the algorithm ID asked for is 00 fb and 00 fd. Across 5 boot attempts (albeit only 3 of those with full command/response logging), the problem always happens in this exact spot. Over 40 various TPM2 commands were executed without any issue[^1] by firmware, GRUB and earlier Linux code before these.

Unfortunately, this is part of kernel probing for TPM existence, any failure results in /dev/tpm0 not being crated. Without that, tpm2-tools won't work. It is still possible to manually craft and ram those commands through TPM MMIO using e.g. busybox devmem. On the flip side, by doing so raw bytes can be observed instead of results parsed by tpm2-tools without worrying about unreasonably large response sizes. Tomorrow I'll try to send those three exact commands to see if the problem is in sending the response or following command. There is a possibility that it can't be reproduced that way, but at least it would point to something time-related.

[^1]: except that TPM2_SelfTest() is hardcoded to always succeed immediately. This one function takes 11.5 minutes to complete, and it seem to violate TCG PC Client Platform TPM Profile Specification for TPM 2.0 by not running tests in the background, but this is problem for later.

krystian-hebel commented 6 months ago

I got few more notes on that subject:

PC receives broken data, it is valid for 00 04 and 00 0c, but not for 00 0b and 00 0d. This explains why the kernel tries to read just those two hashes - it does it to get length for IDs it doesn't recognize.
I played a bit with TPM2_GetRandom - it allows to easily control the size of result in 0x0c-0x4c range (up to 0x40 bytes of random data plus header), and the response includes also size of random data so 0x00-0x0b can also be checked.
LPC sends data through 4 lines, starting with the lower nibble. During read, before data there is 0x0 sent (SYNC cycle), and after that 0xF (TAR, turn-around).
Results:
- 0x00-0x06, 0x08-0x0a, 0x0c, 0x10-0x1d, 0x20-0x2e, 0x30-0x3e, 0x40-0x4c don't get changed
- 0x07 becomes 0xf7 (XOR between nibbles: 0x7, 3 bits change 1->0)
- 0x0b becomes 0xfb (XOR: 0xb, 3 bits 1->0)
- 0x0d becomes 0xfd (XOR: 0xd, 3 bits 1->0)
- 0x0e becomes 0x6e, 0xee or 0xfe (XOR: 0xe, 3 bits 1->0)
- 0x0f becomes 0xff (XOR: 0xf, 4 bits 1->0)
- 0x1e becomes 0xfe (XOR: 0xf, 3 bits 1->0, 1 bit 0->1)
- 0x1f becomes 0xff (XOR: 0xe, 3 bits 1->0)
- 0x2f becomes 0xff (XOR: 0xd, 3 bits 1->0)
- 0x3f becomes 0xff (XOR: 0xc, 2 bits 1->0)
Occasionally I got all 0xff in the response, both when command reported completion or not (I don't depend on dataAvail bit, just an ordinary 2s sleep, to not get stuck in case of errors elsewhere). This is expected only if the command hadn't finished yet.
Wires between platform and TwPM (~10 cm) are not shielded.
- LCLK is right next to LAD[0]. Ground is supplied through separate USB cable, a long way from LCLK.
- Putting TwPM power supply cable parallel to LCLK seems to make 0x3f work (usually), but it also breaks 0x4b.
- I didn't repeat all tests for this setup, just this one.
Except for 0x0e, results are suspiciously predictable.
Except for 0x3f, value changes when majority of bits change from high to low between nibbles. However, there are many cases when data on LAD lines goes from 0x0 to 0xf or the other way around (cycle type, SYNC, TAR), and this doesn't seem to cause problems. This seems to happen only when TwPM drives those signals, so maybe drive strength is wrong?

krystian-hebel commented 6 months ago

With TPM_Random() crafted to return consecutive byte values instead of random I managed to test all bytes. After setting drive strength to 4 only two bytes (0x0f and 0x8f) were changed. After that I got rid of breadboard, connecting TwPM directly through jumper wires, I also moved ground to in between LCLK and LAD lines to minimize electromagnetic noise produced by clock. This seems to be enough, now every byte is transmitted correctly, both to and from TwPM. With TPM2_SelfTest() still hardcoded to return success, I could finally boot with TPM detected and tpm2-tools working reliably.

Despite that, most of longer commands time out. Based on the serial output from TwPM, TPM2_CreatePrimary() used to take about an hour, but tpm2_createprimary has a timeout of 5 minutes. With instruction cache enabled on NEORV32 this got down to ~15 minutes, with FAST_SHIFT_EN - ~10 minutes. FAST_MUL_EN didn't change it too much, but it made the timings fail (50.33 MHz instead of 50.40 MHz, in tests it seemed to be working regardless, but I wouldn't count on it).

After (re-)enabling DMEM execution time went down to 5 minutes 15 seconds, barely not enough. On this hardware we can get up to 64KB of DMEM, which is not enough to cover all required data and instructions. By moving data below the code (which is already cached anyway) in linker script and stripping every possible buffer (and some impossible ones just for testing, like breaking failure mode by returning pointer to data on stack instead of static buffer) to fit as much as possible in those 64KB, I managed to get tpm2_createprimary finish in about 4-4.5 minutes. Still, following tpm2_create times out, it would need about 7-8 minutes.

Unfortunately, there is an issue with data cache. Enabling it with more than one block (DCACHE_NUM_BLOCKS) results in Zephyr not booting, regardless of size of that block. Adding fences (they were removed in this commit, no explanation why) doesn't seem to change anything. Interestingly enough, bootloader works fine even with >1 blocks. Perhaps there is something different in the way Zephyr accesses data (sizes, alignment, order)? I added commands to dump memory from the bootloader after a warm reboot and I'm pretty confident that Zephyr starts executing - BSS gets loaded as all 0xFF but I can see it is already cleared by Zephyr, with some data on the stack. I haven't yet checked to which point in the code it gets, but this looks like a logical next step. Interestingly, there are no fence nor fence.i instructions produced by Zephyr, even though AFAICT all required Kconfig options are set.

Using one block allows for booting, but it actually makes it slower - every time data is needed from outside cached region, a whole block is trashed and fetched, even if only one byte is acted upon. This is clearly visible by setting big block size (4KB or more) and watching as serial output characters are printed one after another.

BeataZdunczyk commented 6 months ago

I am closing this issue as we have tested changes made so far and published the test results in https://github.com/Dasharo/twpm-docs/issues/21.