WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.4k stars 696 forks source link

AES-NI #1433

Open MaxGraey opened 3 years ago

MaxGraey commented 3 years ago

I'm not sure if this is the right place for this proposal, but I would like to suggest some very useful commands to speed up cryptography and especially cryptographic and non-cryptographic hashing.

What are the instructions being proposed?

AES-NI (Advanced Encryption Standard New Instructions) is extended instruction set which accelerate AES encryption / decription.

v128.aes.enc(a, b)

Perform one round of an AES decryption flow

general:

fn aesenc(v128 a, v128 b) {
  return MixColumns(ShiftRows(SubBytes(a))) ^ b
}

x86: aesenc ARM: AESMC + AESE + EOR PPC: vcipher

v128.aes.enc_last(a, b)

Perform the last round of an AES decryption flow

general:

fn aesenc_last(v128 a, v128 b) {
  return ShiftRows(SubBytes(a)) ^ b
}

x86: aesenclast ARM: AESE + EOR PPC: vcipherlast

v128.aes.dec(a, b)

Perform one round of an AES decryption flow

general:

fn aesdec(v128 a, v128 b) {
  return MixColumnsInv(ShiftRowsInv(SubBytesInv(a))) ^ b
}

x86: aesdec ARM: AESIMC + AESD + EOR PPC: vncipher

v128.aes.dec_last(a, b)

Perform the last round of an AES decryption flow

general:

fn aesdec_last(v128 a, v128 b) {
  return ShiftRowsInv(SubBytesInv(a)) ^ b
}

x86: aesdeclast ARM: AESD + EOR PPC: vncipherlast

v128.aes.keygen(a, imm8)

Generating the round keys used for encryption

general:

fn aeskeygen(v128 a, u8 imm) {
  X3[31:0] = a[127:96]
  X2[31:0] = a[95:64]
  X1[31:0] = a[63:32]
  X0[31:0] = a[31:0]
  RCON[31:0] = ZeroExtend(imm8)
  vDst[31:0] = SubWord(X1)
  vDst[63:32] = RotWord(SubWord(X1)) ^ RCON
  vDst[95:64] = SubWord(X3)
  vDst[127:96] = RotWord(SubWord(X3)) ^ RCON
  return vDst
}

x86: aeskeygenassist ARM: Efficient emulation on ARM (See emulating-x86-aes-intrinsics-on-armv8-a):

__m128i _mm_aeskeygenassist_si128_arm(__m128i a, const int imm8) {
    a = vaeseq_u8(a, (__m128i){}); // perform ShiftRows and SubBytes on "a"
    uint32_t rcon = (uint32_t)(uint8_t)imm8;
    __m128i dest = {
        // Undo ShiftRows step from AESE and extract X1 and X3
        a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
        a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
        a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
        a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
    };
    return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}

PPC:

__m128i _mm_aeskeygenassist_si128_ppc(__m128i a, const int imm8) {
    a = __builtin_crypto_vcipherlast(a, (__m128i){}); // perform ShiftRows and SubBytes on "a"
    uint32_t rcon = (uint32_t)(uint8_t)imm8;
    __m128i dest = {
        // Undo ShiftRows step from vcipherlast and extract X1 and X3
        a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
        a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
        a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
        a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
    };
    return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}

Details about operations like MixColumns, ShiftRowsInv and etc see Intel's white paper

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

What use cases are there?

Maratyszcza commented 3 years ago

AES instructions differ very substantially between x86 and ARM. The proposal should address a way to unify over these differences.

nemequ commented 3 years ago

POWER also has AES instructions, FWIW:

They were all added in POWER9 (ISA 3.0).

That said, I don't think these fit here since they're not really relaxed. Is there anywhere to gather ideas for a second version of the simd proposal?

MaxGraey commented 3 years ago

@Maratyszcza I added some details for x84 and ARM. ARM has slightly more low level intrinsics but it seems all fit into one API. Later I'll add couple more instructions related to AES

MaxGraey commented 3 years ago

I added another operation - v128.aes.keygen. It's pretty easy to emulate on ARM. On PowerPC you can do something similar

penzn commented 3 years ago

Looks reasonable to me.

jan-wassenberg commented 3 years ago

+1 to these being useful. @nemequ we can consider them 'relaxed' if we extend the definition to "each 128-bit block within a vector", right? SVE2 does that (unfortunately it's an optional extension), and that would also cover x86 VAES.

penzn commented 3 years ago

we can consider them 'relaxed' if we extend the definition to "each 128-bit block within a vector", right?

@jan-wassenberg, AFAIK, this proposal is limited to 128 bit operations, however we can probably consider it SIMD 2.0 of sorts, which should let us include these operations. I would be interested to pull this into flexible vectors to have the semantics you described 😉

jan-wassenberg commented 3 years ago

@penzn OK :) I'd also welcome both v128 and flexible AES.

jlb6740 commented 3 years ago

Explicit instructions for AES support are useful but I share the concern/question that it doesn't really fit here. What SIMD instructions are being relaxed here? Don't we expect the same result no matter the platform? What's the fall back for the VM if a VM doesn't support these relaxed instructions?

MaxGraey commented 3 years ago

What's the fall back for the VM if a VM doesn't support these relaxed instructions?

You can follow the same strategy as with qfma / qfms instruction which exposes extra environment variable (fpenv). AES-NI is very similar in this respect because I don't think it is possible to safely and efficiently emulate these instructions on scalar-only VMs and better follow a different strategy to implement AES or hashes on such platforms.

It is also worth remembering that these instructions are executed on constant time, which is quite important in term of security.

The AES instructions are designed to mitigate all of the known timing and cache side channel leakage of sensitive data (from Ring 3 spy processes). Their latency is data-independent, and since all the computations are performed internally by the hardware, no lookup tables are required. Therefore, if the AES instructions are used properly (e.g., as in the following code examples) the AES encryption/decryption, as well as the Key Expansion, would have data-independent timing and would involve only data-independent memory access. Consequently, the AES instructions allow for writing high performance AES software which is, at the same time, protected against the currently known software side channel attacks

jan-wassenberg commented 3 years ago

I don't think it is possible to safely and efficiently emulate these instructions on scalar-only VMs

I recently implemented a constant-time fallback using basic SIMD only.

MaxGraey commented 3 years ago

As I know most problematic for non-SIMD emulation is Galois field Multiplication

penzn commented 3 years ago

I recently implemented a constant-time fallback using basic SIMD only.

This reminds me of some instructions we have added late in the SIMD proposal. I think it would be doable, but we would need to prototype and measure.

I don't think it is possible to safely and efficiently emulate these instructions on scalar-only VMs

I don't think we should worry about scalar-only VMs, especially after SIMD making it to stage 5.

lars-t-hansen commented 3 years ago

Explicit instructions for AES support are useful but I share the concern/question that it doesn't really fit here. What SIMD instructions are being relaxed here? Don't we expect the same result no matter the platform?

Just to be pedantic, I don't think we should restrict this proposal to "relaxed" variants of existing SIMD instructions, assuming that that's what you were intending to say here. IMO any 128-bit SIMD operation whose best implementation might have different corner case behavior on some platform of interest should be fair game for the proposal (fma being a case in point).

MaxGraey commented 3 years ago

Actually, I don't mind if it's a Fixed SIMD 2.0 or Fixed SIMD Extensions proposal. It's just that such a repository does not exist yet

nemequ commented 3 years ago

Just to be pedantic, I don't think we should restrict this proposal to "relaxed" variants of existing SIMD instructions, assuming that that's what you were intending to say here. IMO any 128-bit SIMD operation whose best implementation might have different corner case behavior on some platform of interest should be fair game for the proposal (fma being a case in point).

Does this have different behavior on different platforms? FMA fits because the results will be different on different platforms since the operations may or may not actually be fused.

The summary for this proposal is:

This proposal adds a set of useful SIMD instructions that introduce local non-determinism (where the results of the instructions may vary based on hardware support).

So unless I'm missing something (definitely possible, I'm not really familiar with these work internally) this seems out of scope to me. FWIW, I'd very much like to start gathering potential instructions for a SIMD 2.0 extension, and think this would be a great addition to that list.

lars-t-hansen commented 3 years ago

Just to be pedantic, I don't think we should restrict this proposal to "relaxed" variants of existing SIMD instructions, assuming that that's what you were intending to say here. IMO any 128-bit SIMD operation whose best implementation might have different corner case behavior on some platform of interest should be fair game for the proposal (fma being a case in point).

Does this have different behavior on different platforms? FMA fits because the results will be different on different platforms since the operations may or may not actually be fused.

Again, being pedantic, FMA is not a "relaxed" version of any previous instruction, it is a new instruction with relaxed behavior. IMO all 128-bit instructions with relaxed behavior are in scope.

I don't know whether the AES instructions fall into that category; I was simply reacting to @jlb6740's wording and wanted to clarify that instructions with new behaviors are not out of scope for relaxed-simd.

MaxGraey commented 3 years ago

As I understand it Relaxed SIMD is a category of SIMD operations which do not guarantee deterministic behavior (e.g. with NaN or signed zero) or accuracy (as in the case of FMA instruction on platforms where it is not available). In the case of AES we have only one non-determinism: that we cannot guarantee same performance (cost model) for all platforms. In fact, many of the already standardized Fixed SIMD instructions cannot guarantee this either. So as far as I'm concerned this can really be related to SIMD 2.0

ngzhian commented 3 years ago

I think we agree that these AES instructions don't fit into relaxed-simd, but we also don't have a SIMD 2.0 proposal. As concrete next step, we can approach this as a new feature and follow the process that we have in place for new features, starting with filing an issue on the design repository (I can also transfer this issue there).

It is up to interested participants to scope out the proposal: it could be a SIMD 2.0 proposal consisting of even more SIMD instructions, or it could be only AES instructions (as detailed here), or it could be a broader crypto instructions proposal. This will be driven by the proposal champion(s).

I can provide pointers and help guide the process if @MaxGraey (or anyone else) is willing to take this on.

Update: in the 2021-08-06 SIMD sync, we discussed a potential streamlined/lightweight process to make small changes to the spec, this is not confirmed yet, but could also work for AES (if/when it happens).

penzn commented 2 years ago

@MaxGraey, from discussion at the SIMD subgroup meeting, we should probably highlight some use cases for these instructions, then it can be taken to the CG.

MaxGraey commented 2 years ago

@penzn "What use cases are there?" point at the end takes some use cases. To summarize, these instructions are used mainly in two cases: to speed up AES encryption/decryption and to speed up сryptographic and non-сryptographic hash functions. Cryptographic hashes can often be used for databases and for very simple HashMap / HashSet containers with high resistance to HashDoS attacks.

penzn commented 2 years ago

Thanks @MaxGraey, to me this looks sufficient. @dtig, what do you think? Should we move the issue to design?

dtig commented 2 years ago

Thanks @penzn and @MaxGraey this sounds like a good start, I'll transfer this issue to the design repository as it is out of scope for relaxed-simd. Please note that a streamlined/lightweight process will still need a champion and satisfy the requirements outlined in the Phases document.