EC access is sometimes corrupted.

kiram9 commented 1 year ago

We were testing out the EC driver, and we think there may be a data consistency issue around access to the ec from your driver and other bios functions. We see if we do a lot of flash access from the OS, that sometimes the data is not what we expect to be in the flash. @JohnAZoidberg

I think that this may be due to ACPI access to the EC may be interleaved with this driver access. Looking at your driver it seems to be doing direct port io.

If you look at our EC interface in ACPI it will acquire a lock around any EC transactions. Some of these ACPI methods will be invoked to get temperature readings every second or so, for various drivers, for example the intel DTT driver will be reading temperature sensors. And we think these may corrupt access from this tool.

See for example EC0 Method M005 framework_dsdt.txt

DHowett commented 1 year ago

Thanks for this. Yeah, I have been worried about exactly this happening.

I've got a couple ideas and notes about what we could and sould do to work around it.

Solution 1: call `M001` et al (insufficient)

Proposal: perhaps the driver could call EC0.M001 through M005 to transact with the EC.

Unfortunately, we need locking around the entire packet. The individual read/write primitives lock around the individual operation. Since a packet is composed of up to 256 individual reads or writes, that leaves 256 sequence points between which other ACPI functions could take the lock and corrupt an in-flight exchange.

Solution 2: allow the OS to lock through ACPI (insufficient)

Proposal: Add a new node with a new _DSM that allows us to lock/unlock ECMT from the OS.

I actually tried this one!

I introduced a new sub-node of EC0 named ECPR ("EC protocol")¹; it was given a _HID of FRMW0003 and support for a new _DSM.

DSM UUID: 8829106f-5320-44e4-8f20-fa67326a71eb Function IDs

ID	Description
0	Validity mask
1	Lock ECMT
2	Unlock ECMT

It won't work because according to the ACPI spec 6.4, section 19.6.88, "ownership of a Mutex must be relinquished before completion of any invocation."

ASL for FRMW0003

``` // ... inside EC0 Device (ECPR) { Name (_HID, "FRMW0003") // _HID: Hardware ID Name (_STA, 0x0F) // _STA: Status Method (_DSM, 4, Serialized) // _DSM: Device-Specific Method { If ((Arg0 == ToUUID ("8829106f-5320-44e4-8f20-fa67326a71eb") /* EC Protocol GUID */)) { If ((0 == ToInteger (Arg1))) { Switch (ToInteger (Arg2)) { Case (0) { Return (Buffer(){0x03}) } Case (1) { Local0 = Acquire(ECMT, 1000) Return (Buffer(){Local0}) } Case (2) { Release(ECMT) Return (Buffer(){0x00}) } Default { Return (Buffer(){0x00}) } } } Return (Buffer(){0x00}) } } } ```

Solution 3 - move packet handling to ACPI (untested)

I believe it would be possible to move the transmission/receipt of an entire packet down into a DSM. I have not yet tested this, but plan to do so this weekend (1-14 - 1-16 as of writing).

We could send a buffer of up to 256 bytes in the fourth argument package to that new DSM, and it could return a package containing an integer length and an up-to-256-byte buffer in response. Everything could happen under ECMT.

It might still be valuable to expose a separate node, see footnote 1.

This would make this driver effectively Framework-specific, and would of course require a firmware revision. :smile:

Notes

I just noticed that as of TGL 3.17, the UCSI methods in SSDT 8 also lack sufficient locking around mutex ECMT!

¹ I think it would be beneficial to do this anyway, as it would remove the need for creating a virtual device ROOT\CrosEC and the driver INF file could just bind the ACPI HID directly. It allows for automatic installation and it removes complexity from the driver wherein it needs to look up its constituent ACPI nodes from the root.

DHowett commented 1 year ago

Solution 3 - move packet handling to ACPI (half-tested)

Miraculously, it worked! There's a lot of TODOs left.

ASL, packet handling

```asl OperationRegion (EIO2, SystemIO, 0x0200, 0x05) Field (EIO2, AnyAcc, Lock, Preserve) { Offset(0x04), HCMD, 8 } Function (XCMD, {BuffObj}, {PkgObj}) { Local0 = Acquire(ECMT, 1000) If (0 == Local0) { // Wait for ready While((HCMD & 0x6) != 0) { Sleep(10) } Local1 = DerefOf(Arg0[0]) Local0 = SizeOf(Local1) Local2 = 0 While(Local2 < Local0) { // Write one byte at a time (oops) // TODO(DH): Refactor to use 32-bit writes when possible // Optimization: address always starts at 0; we don't need to // handle a starting offset M002(Local2, DerefOf(Local1[Local2])) Local2 += 1 } HCMD = 0xDA While((HCMD & 0x6) != 0) { Sleep(10) } Local0 = 256 Local1 = Buffer(Local0){} Local2 = 0 While(Local2 < Local0) { // TODO(DH): We could read the beginning of the packet to // figure out how large of a buffer to allocate. Local1[Local2] = M001(Local2) Local2 += 1 } Release(ECMT) Return(Local1) } Release(ECMT) Return(Buffer(){0x00}) } Function (XMEM, {BuffObj}, {IntObj, IntObj}) { Local0 = Acquire(ECMT, 1000) // TODO(DH): read memory; Arg0 is the offset, Arg1 is len Release(ECMT) Return(Buffer(){0x00}) } Function (XSTR, {BuffObj}, {IntObj}) { Local0 = Acquire(ECMT, 1000) // TODO(DH): read string from memory; Arg0 is the offset Release(ECMT) Return(Buffer(){0x00}) } Method (_DSM, 4, Serialized) { // Function (_DSM, {BuffObj, PkgObj}, {BuffObj, IntObj, IntObj, PkgObj}) // _DSM: Device-Specific Method If ((Arg0 == ToUUID ("8829106f-5320-44e4-8f20-fa67326a71eb") /* EC Protocol GUID */)) { If (0 == ToInteger(Arg1)) { Switch (ToInteger(Arg2)) { Case (0) { Return (Buffer(){0x07}) } Case (1) { // Arg3 - Package{Buffer(256)} Return (XCMD (Arg3)) } Case (2) { // Arg3 - Package{Int, Int} Return (XMEM (ToInteger(DerefOf(Arg3[0])), ToInteger(DerefOf(Arg3[1])))) } Case (3) { // Arg3 - Package{Int} Return (XSTR (ToInteger(DerefOf(Arg3[0])))) } Default { Return (Buffer(){0x00}) } } } Return (Buffer(){0x00}) } Return (Buffer(){0x00}) } ```

It seems a lot slower, but the safety might be worth it.

Solution 4 - userspace retry

Admittedly, a much quicker solution would be to retry when you get a checksum error. It's not ideal, but for idempotent operations it might not be terrible.

DHowett commented 1 year ago

Alright, this more than proves out the concept.

DSDT patch, not cleaned up (it's missing some buffer length checks on the misaligned read/write check)

CrosEC patch, not cleaned up

This is all alpha-quality, and I should state for the record that while it is fun/interesting I don't know if it's exactly the right thing to do. I took it on as a challenge, after all. :)

It would be much "easier" to revisit Solution 2 by offering a primitive for locking and letting the driver hold it for a little bit; it would also keep the driver relatively clean of system-specific concerns. The original design (which is still present in the diff above) abstracted locking out to optionally include evaluating a DSM. :smile:

DHowett commented 8 months ago

Solution 5 - Take ECMT at the OS layer to block ACPI (impossible)

This actually works well, for Linux. A user named jubnut on the community forum has been working on a patch to do this.

Unfortunately, the Windows driver model does not offer the ability to just take a named ACPI mutex. ☹️

DHowett / FrameworkWindowsUtils