Fixed size_t storage in lock_auth for 32-bit platforms.

lock_auth was always storing ad_size and text_size using store64_le. On 32-bit platforms, this was also storing whatever data happened to follow the 32-bit ad_size and text_size values. If the following 32-bits was non-zero, then the MAC would be computed with the wrong length causing decryption to fail.

Risk assessment: low - the behavior would fail safe preventing decryption. However, developers on 32-bit platforms should have experienced unexplained problems with unlock failing to work as expected.

Wait a minute, how is this even possible? My understanding of the C standard tells me this is not. store64_le() is defined thus:

static void store64_le(u8 out[8], u64 in)
{
    store32_le(out    , (u32)in );
    store32_le(out + 4, in >> 32);
}

Function call arguments are supposed to be converted/promoted to the parameter type before we do anything. In our case, we promote size_t into an uint64_t. On 64 bit platforms that’s a no-op, and on 32-bit platforms it would have the same effect as an explicit cast.

So in theory my code is correct. But how about in practice? Do you have a way to reproduce the bug? I have ways to test 32-bit platform, maybe you can show me the failing code?

You are definitely right. store64_le should promote size_t (which is u32) to u64. When I saw this by inspection, I just saw size_t being passed to something that incremented by 8 and thought I found the issue!

I agree that it is most likely something else that was (intermittently?) corrupting memory in my system. I saw it work once, and credited this fix. Seems like I may have been too eager. I have lots of moving parts in the code now, but I'll keep an eye out over the next few days.

I will close and reopen another pull request / issue if it's not my code.

Works for me.

About your fix itself (assuming you were right and I am wrong), I think we have a better way. To the extent possible, I try to avoid conditional compilation, which makes tests more complicated and brittle. In this case, I believe an explicit cast, or at the very worst copying the argument to a local variable, should be enough. Something like the following:

-    store64_le(sizes + 0, ad_size);
-    store64_le(sizes + 8, text_size);
+    store64_le((u64)sizes + 0, ad_size);
+    store64_le((u64)sizes + 8, text_size);

Hi @LoupVaillant - I have been staring at my code today. You are definitely right that this Monocypher "fix" was much too hasty. I believe that the Monocypher code is correct as given. I have already found a number of other buffering issues and linker script issues in my code.

Regarding buffering complexities, "arbitrarily" splitting the firmware into blocks for AEAD is not helping. I have decided to go a different way. The system doesn't need AEAD since it validates both an inner EdDSA signature (authenticity) and an outer EdDSA signature (approved for production release). I now believe that a more appropriate confidentiality solution is to use XChaCha20 in counter mode and decrypt blocks as they arrive. I can use the Monocypher incremental signature verification API for the outer signature. The inner signature can be either incremental or performed after everything else. I believe that this approach will greatly simplify my buggy buffering code with no loss of security.

Thank you for taking the extra time to consider my pull request and provide quick feedback. You helped me quickly remove Monocypher from my blame list ;)

I can use the Monocypher incremental signature verification API for the outer signature. The inner signature can be either incremental or performed after everything else.

Hmm, be very careful with this one: incremental signature verification sounds like you will pass down data to the rest of the system before the verification is complete. If you end up actually reading that unauthenticated data you may expose yourself to very serious attacks. To do this safely, you need to merely store that data somewhere, and only start using it after the signature is complete. Cryptographic doom principle and all that.

Alternatively you could make sure that the operations that result of processing unauthenticated data can be reverted if the signature happened to fail. Only do this if you absolutely have no choice however, because this would greatly increase the attack surface of your system.

Hi @LoupVaillant and thanks for the tips. You are right that the software will write to flash incrementally. While the microcontroller has room for a bootloader and redundant update images, it only has room for one application image. I think that I have a reasonable solution. The header (before the signed part) has a spare u32[4] field. The software forces this field to all 1's. Flash only writes to 0, and can only get back to 1 with erase. When each phase passes, the software will write some predetermined bits in this field to zero. The bootloader then only considers images that have the appropriate full set of bits set to zero. To minimize the ease of power glitch attacks, it's better if each stage updates its bits separately. For this reason, it's probably better to compute the inner signature as a contiguous block on the fully decrypted image.

I have done this or things like it many times across multiple projects for firmware updates, but that doesn't mean it's the best way. What do you think? Do you see anything that I am missing?

For what it's worth, here is my work-in-progress Nth-generation firmware image format:

/**
 * @brief The format for each section.
 *
 * This structure presumes little-endian storage.
 * As of 2022, big-endian is not used by any major platform or OS.
 */
struct fwimg_section_v1_s {
    uint8_t magic_header[32]; // FWIMG_HEADER

    /**
     * @brief The entire section length in bytes.
     *
     * This length starts with the first magic_header byte
     * and extends to the last section data byte.
     */
    uint32_t length;

    /**
     * @brief The bitwise negation of the length field.
     *
     * Provides protection against data corruption.
     */
    uint32_t length_negate;

    /**
     * @brief The section format version.
     *
     * 31:24: version = 1
     * 23:0: options (reserved), set to 0
     */
    uint32_t format_version;

    /**
     * @brief The bitwise negation of the format_version field.
     *
     * Provides protection against data corruption.
     */
    uint32_t format_version_negate;

    /**
     * @brief Field reserved for target state management.
     *
     * Write to 0xffffffff.
     *
     * The target firmware upgrade process often requires that an image
     * is erased and programmed in place sequentially.  This header is
     * often written first followed by the image.  The bootloader may detect
     * this header but the image has not been fully written or validated.
     *
     * While the bootloader can validate the signature on each launch,
     * signature validation may be too expensive.  The bootloader may
     * opt for no validation or just hash validation.  However, the
     * upgrade process must ensure that the image is not launched
     * until validation occurs, even during failures and power loss.
     *
     * This field allows the target to keep track of the present state
     * of this image.  Flash device "writes" often toggle bits from 1 to 0,
     * but not from 0 to 1 without a separate erase.  The target device
     * can initially write 1's to this field and then write 0's
     * to track progress.
     */
    uint32_t state[4];

    /**
     * @brief The public key signature over this section.
     *
     * The signature is computed starting with the "signature_type" field
     * extending to the end of this section.
     *
     * If the section is encrypted, this signature should use the
     * distribution key.  If the section is unencrypted, this signature should
     * use the release key.
     */
    uint8_t signature[64];

    /**
     * @brief The first hash over this section.
     *
     * The hash computation starts with the "signature_type" field and
     * extends to the end of this section.
     * Hashes may not use all 64 bytes.
     */
    uint8_t hash1[64];

    /**
     * @brief The second (alternative) hash over the section data.
     *
     * The hash computation starts with the "signature_type" field and
     * extends to the end of this section.
     * Hashes may not use all 64 bytes.
     */
    uint8_t hash2[64];

    /**
     * @brief The fwimg_signature_type_e inner asymmetric signature over the section.
     */
    uint32_t signature_type;

    /**
     * @brief The fwimg_hash_type_e for the hash1.
     *
     * Multiple hash types allows
     */
    uint32_t hash1_type;

    /**
     * @brief The fwimg_hash_type_e for the hash2.
     */
    uint32_t hash2_type;

    /**
     * @brief The fwimg_encryption_type_e symmetric encryption type for this section.
     *
     * An encrypted section contains a full unencrypted section, including
     * this header.  Including the full section ensures that the
     * receiver can fully validated teh unencrypted section.
     */
    uint32_t encryption_type;

    /**
     * @brief Reserved for future use.  Set to 0.
     */
    uint64_t reserved1_u64;

    /**
     * @brief The time that the image was created.
     *
     * This timestamp is given in seconds since the Unix epoch,
     * 1970-01-01T00:00:00Z.
     */
    int64_t timestamp;

    /**
     * @brief The image revision.
     *
     * The section revision should be stored in little-endian format
     * with a 16-bit patch, 8-bit minor revision and 8-bit major revision.
     */
    uint32_t version;

    /**
     * @brief The pseudo-unique vendor identifier for the image.
     */
    uint16_t vendor_id;

    /**
     * @brief The product identifier assigned by the vendor.
     */
    uint16_t product_id;

    /**
     * @brief The image subtype assigned by the product.
     *
     * A single product may include multiple subtypes, such as
     * firmware, FPGA bitstreams and calibration data.  Each product may
     * assign values for this field or not use it.
     */
    uint16_t subtype;

    /**
     * @brief The section index.
     *
     * This field is used to generate the nonce for images that
     * have more than one section.  Otherwise, set to 0.
     */
    uint16_t section;

    /**
     * @brief The bitmask of hardware revision compatibility for this image.
     *
     * Each bit represents a potentially incompatible hardware revision.
     * This field should set the bit for each hardware version supported.
     * If this field is not used, set to 1.
     */
    uint32_t hardware_compatibility;

    /**
     * @brief Reserved for future use.
     *
     * Write 0.
     */
    uint32_t reserved2_u32;

    /**
     * @brief The authentication data for symmetric encryption.
     *
     * For ChaCha20 + Poly1305, only the first 16-bytes are used.
     * The nonce is always the 24 bytes of metadata starting with
     * the "timestamp" field.
     * Both encryption and decryption should update the "section" field
     * when more than one section is used to prevent nonce reuse.
     * The authenticated data is always from
     * "timestamp" [inclusive] to this "mac" field [exclusive].
     */
    uint8_t mac[64];

    /**
     * @brief Pad to 512 bytes.
     *
     * This pad allows room for future features and guarantees data alignment.
     * Set to 0.
     */
    uint8_t reserved_pad[140];

    /**
     * @brief Optional, arbitrarily structured metadata as a JSON formatted UTF-8 string.
     *
     * Write to 0 if unused.
     */
    char metadata_json[512];

    /**
     * @brief The section data, such as the firmware image.
     *
     * This field is guaranteed to start at offset 1024 from the
     * first magic header byte.  The bootloader and linker files
     * must be designed to accommodate this offset.
     *
     * This field must be length padded to a multiple of 512 bytes.
     */
    int8_t data[];
};

I might not be Loup, but perhaps you may find my words to be of some use regardless. I assume the firmware header sections are stored as-is and loaded unchanged into memory, with alignment and padding figured out. He's free to correct me, but I'm reasonably confident on these points.

I'm not sure why you need glitching protection when writing with a multi-phase design unless the first-stage bootrom is rewritable: You have a signature check for integrity/authenticity. But there's no harm in it as far as I can tell.

The first thing I notice is that this is a section header. Therefore, there must be some kind of section table. Is that one signed, too, and checked before the section header is ever examined? Otherwise, people can manipulate section headers and shuffle sections around (and section headers if a section header check bypass is discovered). Similarly, the section header should not be tracking its own index. The bootloader knows the index implicitly, even if that makes nonce handling more awkward.

I'm not sure why you need 32 bytes of FWIMG_HEADER magic. UUIDs, for example, do just fine at 16 bytes. That's an instant gain of 16 bytes to me, maybe more depending on what this magic is even used for except as a (redundant) version field.

Is there a reason you'd need to guard the length and version fields with negated fields? The section header should be signed (see above). If it verifies, data corruption is off the table.

I am deeply concerned about the comment on the state field about a bootloader skipping signature validation. Recall glitching attacks, which seem to be part of the threat model. Now you just have to glitch into the non-validating branch of code (possibly just jumping over a couple of one instruction) and you're off to the unsigned code execution races. If anything, the signature validation should happen twice in different variations (e.g. once with a cmp and another time using the crypto_sign return value as a bitmask for correct execution) and different times to avoid people glitching over just one check (isn't there a fairly obvious target to skip over one signature check that also happens right after a flash read?).

A signature check already internally computes a hash. If you really insist on a second hash (why? Any attacker that gets past the signature check(s) can also just tamper with the hash field), hash1 is plenty if you choose it such that it is different from the one for signature verification. You can drop hash2 and hash2_type entirely for all I can tell.

Is the timestamp being int64_t a typo? There's no way an image could possibly be made before the epoch, so there's no need for negative numbers.

Are you sure product IDs won't run out?

This is partly opinion, but I'm not a big fan of the segmented version field. Consider the case of downgrade protection by means of eFuses. Since those necessarily are non-rewritable, the version field has to act as a bit vector instead. Given this, I'd opt to make the version field an uint64_t and leave use and interpretation thereof to the rest of the tooling.

I'm not sure how I feel about the sign+encrypt construction and especially this much plaintext metadata. Normally, you would sign-then-encrypt (Yuliang Zheng/Hideki Imai, How to construct efficient signcryption schemes on elliptic curves, Information Processing Letters, vol. 68 [1998], iss. 5, pp. 227–233, p. 227[Fn. 1]). Thus you'd end up more with something like:

system parameter: pre-shared 256-bit AEAD key k burnt into bootloader
system parameter: elliptic curve public key A
firmware image+header without signature: m
firmware package: state || tag || nonce || crypto_lock(crypto_sign(m) || m), where the nonce is randomly generated for XChaCha20 and state keeps track of partially written images

But I'm leaning very far out the window there, Loup may well disagree.

[1] There are plenty of sign-and-encrypt (signcryption) schemes going around with as far as I xan tell none catching on or having a different threat model (often a certificate authority with two parties sending messages to each other, but firmware updates are per se unidirectional communications), but all of this would involve comprehensive studying of papers and gutting Monocypher's signature internals. I'm thus leaving actual signcryption out of svope.

Ok, keep in mind that I have little experience with embedded software updates, so I may be out of my depth here. I would just like to reword your constraints and come up with the simplest solution I can.

Your firmware has 3 parts:

One bootloader. Presumably set in the factory, and will not (perhaps cannot) be changed in the field.
One or more update images. This is the software you need to make an update. May be replaced. I assume that if you erase all update images, you can kiss update over the air goodbye, so it's important that you always have one working image.
A single application image. You have room for only one anyway, so each update erases the previous image.

I guess that when you update an update image, you first write it to a free slot, and if it checks out you may erase the older one. For the application image however you need to erase the previous one before the full image is even verified. So ideally, only authorised parties should be able to send images to the device. In other respects, both images are functionally the same: you need to receive, store, and verify the image you're receiving. Once all three are done you can tell the bootloader to start trusting that new image.

In addition to verifications, it would seem you would prefer to keep your images secret. So the device needs at least 2 keys: a private X25519 key (possibly derived from and EdDSA identity key), and at least one public key from some trusted server.

To make sure only authorised parties can start an update, I recommend Noise protocol's XK handshake (Monokex is not quite, sorry): the server initiates the connection, the key of the device is known in advance, and the server transmits its own key. And as part of the handshake, the server also transmits a signature (or a full blown certificate) of its public key. Once the handshake is done and the public key of the server certified, the device knows it is talking to an authorised server.

Now we can start sending actual data. That data should come as AEAD packets. Monocypher's crypto_lock() and crypto_unlock() should be enough for that purpose. The real question is how to organise those packets:

The first packet has a small fixed size, and is encrypted with the zero nonce. It's a header containing various things (you decide which you actually need):
- The total size of the plaintext image (abort if it's too big),
- The size of the data packets (may be hard coded, abort if it's too big),
- The version of the upcoming image (abort if it's too old),
- The version of the current image (if you want to enforce which image you're updating from),
- The signature of the upcoming image (if you want the extra security),
- …
The following packets contain a chunk of the image, and their nonce is incremented each time (first data packet has a nonce of 1, then 2, then 3…). We use incrementing nonces instead of random ones to enforce the order of packets.
The last packet will likely be smaller than the others. You're supposed to anticipate this from the image size you were given with the header packet.

That way, each packet the device receives comes from an authorised server. If this succeeds that means you received everything in the right order, and you can continue with the update. If this fails you just need to abort (though I feel for you if you just overwrote half of your application image).

If you want the extra security of a signed image, you can do exactly as you suggested: use the incremental verification API on each plaintext chunk, then verify at the end.

Hi @fscoto and @LoupVaillant - Thank you for taking the time to put together your feedback!

This is a long reply. I appreciate the feedback you have already provided! I also want to be respectful of your time. The following is far beyond Monocypher development & maintenance, so feel free to skip reading this. Putting this response together has already helped me think about your feedback and how to further improve my system!

The target product is a USB-connected device, and all updates happen over USB under the control of software running on the host system.
While we could establish an end-to-end encrypted link to the firmware distribution server, we do not. Instead, the same format is used as a file format, and it also provides confidentiality for data at rest on the host computer. The 32-byte header serves as both the tag in what amounts to a tag-length-value format and as a file differentiator. It also ensures that the host file system does not mess with the file (line endings, endianness, etc) on read:
```
6A 6F 75 6C       | ASCII "joulescope_img_section"
65 73 63 6F       | ASCII "joulescope_img_section"
70 65 5F 69       | ASCII "joulescope_img_section"
6D 67 5F 73       | ASCII "joulescope_img_section"
65 63 74 69       | ASCII "joulescope_img_section"
6F 6E             | ASCII "joulescope_img_section"
0D 0A             | DOS line ending to ensure binary correctness.
20                | ASCII space.
0A                | UNIX line ending to ensure binary correctness.
20                | ASCII space.
1A                | Substitute character (stops listing under Windows).
20 20             | ASCII spaces.
B2                | Ensure that system supports 8-bit data
1C                | File separator.
```
While I probably could reduce this to 16 bytes, I am not overly concerned about taking 1 kB for the full header. The target microcontroller has 1 MB flash. Each updater image is 128 kB, and the main application is 512 kB. The rest is reserved for bootloader, personality, logging, configuration, & user data storage.

Using a single file available to every device also means that we cannot do key exchange to establish the symmetric key (Noise protocol's XK handshake or any other method). Instead, we have a pre-provisioned symmetric key (see 5) at the cost of additional risk of confidential firmware exposure.

@LoupVaillant - Your description of the sections of a fixed factory bootloader + 3 field-updated images (2 updaters, 1 application) matches what I envision. However, I am very conservative when it comes to firmware updates to avoid bricking a device. Even if a firmware update file contains all three images, the host software would only apply one at a time, rebooting the device between each update: a. reboot into updater 1, which falls back to updater 2 if updater 1 fails b. If in updater 2 (fall-back case), update updater1. Goto a. c. Update updater2. d. Reboot into updater 2 & verify that running updater2. On error, goto a. e. Update updater1. f. Reboot into updater1 & verify that running updater1. On error, goto a. e. Update application. g. Reboot into the application. On error, goto a.
We provision devices in the factory with a personality. In addition to a bunch of product metadata, the personality includes: a. EdDSA public signature key for validating the firmware image authenticity. b. EdDSA public signature key for validating the outer encrypted image authenticity. We use different keys for production vs. test devices. This manual release step prevents test images from accidentally getting deployed in the field. c. symmetric ChaCha20 key for encryption. A single compromised device will expose the firmware but because of (a) it still prevents others from distributing firmware. We could use unique symmetric keys per device to allow revocation, but this dramatically increases firmware distribution complexity. I think that this does exactly what @fscoto recommends, except that I am dropping AEAD. The encrypted data actually contains an entire full image with the same header (except for the encryption_type field since the inner section is unencrypted), so I think that the inner signature validation performs the same integrity check as authenticated encryption. Using XChaCha20 counter mode directly will make decryption implementation so much easier.
The section table is hard-coded as the memory map. Since images are run in place from flash including the exception table making position-independent code challenging, each image is compiled and linked specifically for it's memory map location.
I share @fscoto concern with the state[]. The goal is to fully validate the image on update, and then mark it as good. I would like to avoid the cost of signature validation on every boot to keep boot time short. However, your comment makes me realized that this is an assumption that signature validation takes "too long". I will quantify this! It's definitely better to perform signature validation on every boot than rely solely on the state bits! Regardless, the bootloader will need to check image validity using whatever form(s) we end up using at least twice to provide increased robustness against power glitch attacks.
The section header is not signed from the beginning. The signature starts with the "signature_type" field. "magic_header" has only one allowed value, so signing it does not matter. length is effectively part of the signature, since the signature operates based upon this length. format_version determines how the software interprets this data structure, so bad values should result in validation failure, too. Both length_negate and format_version_negate help detect data corruption. I could use something like crc-32, but negate is simpler, good enough, and doesn't take up too much space for this application.
The hash exists to provide a faster boot validation than signature validation. I will profile signature validation time to see if this matters - see (6). The dual hash options are to allow for two options: Blake2s software validation and whatever hardware-based accelerator exists. In the case of the target chip, this is sha-256. Unfortunately, I found that the hardware accelerator in this chip is not computing sha-256 compliant signatures, and I have no idea exactly what it's computing at this point. I have tried different endianness and initialization vectors, but no luck yet...
I like signed integers for timestamps to make math easy on deltas. You are right that this value will not be negative. I can easily live with 63-bit precision.
Thank you @fscoto for concerns about 16-bit product_id. USB only has a 16-bit product_id, so it's good enough for me! Since product_id is only unique within a vendor_id, it should be good enough. I certainly hope I build 2**16 successful products with this :)
The version field could be a bitfield. I actually treat hardware versions this way, which helps makes the hardware_compatibility field possible. While 64 firmware versions will hopefully be more than enough for released firmware, it is not enough to account for development, too. If I do end up implementing any downgrade protection, it will need to be in software as part of the updater code. While a bitfield would be best for hardware support, major.minor.patch will be fine for the updater software. I really haven't thought about how I would reliably implement downgrade protection. It's easy enough to compare the existing version field with the requested update's version field in the updater. I'll have to think about this more to make sure I don't prevent this feature if I ever need it. Thanks!

Thank you again both for your time and feedback. You have given me a few things to think about and investigate. If you have more thoughts or feedback, I would be very happy to hear it. However, I also want to be respectful of your time. This type of feedback is far beyond Monocypher development & maintenance!

Ah, I see, I completely botched the threat model. The encryption is there to prevent your users from seeing your proprietary code, and as far as I can tell they trigger the update manually from their host computer. So there's no need to protect the device from unauthorised update beyond checking image integrity/authenticity at the very end.

I have two remarks at this point. The first is about the fact your software is proprietary. You sell hardware, so maybe that's where most of your added value lies? I'm curious about what you'd lose by making your software free (like GPLv3). If anything, providing free software could make your hardware more valuable. (For instance I've heard that one reason some GPU vendors don't provide free drivers is because they're afraid they'd give patent suits ammunition to the competition.)

My second point is your use of two signature keys, one for the outer encrypted image, one for the inner firmware image. It feels overkill. While I would agree that in general, encrypt-then-mac is better than mac-then-encrypt, in your case you're not avoiding the cryptographic doom principle with your outer signature, since the limited memory of your device forces you to decrypt the image on the fly, before you even get a chance to verify the outer signature. (You could first write the encrypted image to flash, then decrypt it in place if the signature checks out. But that's slower and wears out your flash twice as fast.)

So I believe you can remove the outer signature entirely. The whole process could look like this:

Make a new firmware image, header and everything.
Sign it & prepend the signature
Encrypt everything (including the signature) with raw XChacha20. Prepend the random nonce.
Send the encrypted image to your user.
User loads the encrypted firmware to the device. The device decrypts everything on the fly & stores the image to disk.
The device automatically reboots. At this points verifications start to happen as part of the normal boot process:
1. The device verifies the signature of the firmware image.
2. The device verifies the version of the firmware, maybe makes sure it's newer than the last.
3. The device tries to complete the boot process.
4. The device marks the current firmware version as the "last firmware used" (if doing it over and over wears out the hardware, only do that when the firmware has changed).

Of course, step 6 only describes the happy path. You could also verify the signature of the firmware image in step 5 if you want to do it before reboot.

If step 6.i is too slow (I'd be curious about your performance tests by the way), you probably want to add a CRC-32 to your firmware image and verify it there instead of verifying the signature. And if implementing CRC-32 is a hassle, try Poly1305 instead. It's slower, the "checksum" is 48 bytes instead of 4 (32 bytes for a random "key", 16 bytes for the "MAC"), but it will certainly be faster than Blake2b, let alone EdDSA.

Hi @LoupVaillant - My customers are really purchasing the ability to build more energy-efficient products. The Joulescope instrument is the actual tool which consists of hardware, FPGA gateware, firmware, and host software. The hardware consists of off-the-shelf components, no custom silicon. I am an open-source advocate, but I also need to take reasonable precautions to ensure some factory in China can't just clone Joulescopes that we have spent so much to develop. I only envision support nightmares for very little value in allowing customers to modify gateware and firmware. So locking down the FPGA and firmware provides cloning production while not significantly reducing value to our customers.

However, customers receive significant value in being able to modify and develop their own custom automation using the host software. While I do risk enabling other competitive products, I feel that the value to my customers of open-source host software outweighs the risk. The host software is already open source with an Apache 2 license. See pyjoulescope, pyjoulescope_ui, pyjoulescope_examples, and jls. We will continue to provide open-source host software.

I agree that the outer signature is overkill from a security perspective. However, it let's me sleep better at night that untested or partially tested software cannot accidentally be released. Besides, the signature is only computed once on the device during firmware upgrade where we can tolerate the extra delay.

I will perform some tests to determine performance, and find what the bootloader can compute each time. Thank you for the recommendation for Poly1305, which I would not have considered! Here are the things for me to test:

Hardware-accelerated SHA-256 like hash, which I can't figure out how to replicate in software.
CRC-32C (software with 1 kB lookup table)
Poly1305 (monocypher)
Blake2b (monocypher)
EdDSA with Blake2b (monocypher)

I'll post the results here when I have them. I had to hop to a different task, but I hope to post the results later this week.

The hardware consists of off-the-shelf components, no custom silicon.

Ah, that kinda changes everything. With that, I understand that copycats are a real concern. Kudos for the free software parts despite this risk.

I have some performance numbers. The results are surprising to me and show just how fast Monocypher is!

The target microcontroller is a ARM Cortex M7 (Microchip ATSAMS70) running at 288 MHz. The code runs from zero wait state ITCM. The stack and temporary data are located in zero wait state DTCM. The hash is computed over 512 kB of flash running with 7 wait states, but with a 128-bit wide memory interface and two 128-bit wide read buffers. I compiled the code using gcc-arm-none-eabi 11.2-2022.02 with "-O1".

Benchmark validate flash: 524288 bytes sum32: 262,176 clocks sha256_hw: 1,704,977 clocks crc-32c: 11,535,713 clocks poly1305: 15,927,233 clocks Blake2b: 22,628,103 clocks EdDSA: 25,633,767 clocks

I am really surprised that poly1305 using crypto_poly1305 is almost as fast as crc-32c. Blake2b using crypto_blake2b is only half the speed of CRC-32C! EdDSA signature validation only takes an additional 3,005,664 clocks over the Blake2b hash computation.

The only caveat here is that EdDSA validation is currently failing since I am running it over an arbitrary area of flash. I think that the crypto_check is correctly performing the validation, but I will follow up if I find out otherwise on a real, signed image.

So, the full EdDSA signature validation takes 25,633,767 / 288e6 = 89 milliseconds. My goal is to reboot in 1/2 second, which is about as fast as USB can detect unplug / replug reliably. The bootloader will implement a full EdDSA signature validation on each boot, @LoupVaillant's step 6.i.

Thank you for encouraging me to benchmark this performance!

So EdDSA verification is eating up less than 20% of your budget. That’s very good news! 🎉

I’ve ran your numbers, and noticed that EdDSA is 88% Blake2b, and 12% is scalarmult and stuff. If by any chance the rest of the boot sequence takes too long, as a last resort you can speed up verification by first running hardware SHA-256 on your image, then signing the hash. I expect a speed up by a factor of 7 or 8, for a total of 15ms (3% of your half second). I would still favour raw EdDSA (it’s much simpler), but it’s nice to know you have faster variants if you really need them.

I am really surprised that poly1305 using crypto_poly1305 is almost as fast as crc-32c.

It’s not that surprising, considering how Poly1305 is implemented: each step reads 16 bytes then performs about 20 64-bit multiplications. Even with all the additions that surrounds them, that’s not a whole lot of operations per byte.

Crc-32c on the other hand needs one lookup per byte. Apparently Cortex M7 doesn’t always have a cache, and this table is almost certainly too big to fit in registers. If your chip doesn’t have a data cache, those lookups are going to be quite slow. You may have a faster CRC by using a 4-bit table that would need only 16 elements, and (I guess) likely fit in 8 registers. This means twice as many lookup, and it will be more complex, but that won’t query memory all the time.

If you do have a cache however memory lookups aren’t the problem. I would look instead at how you are reading bytes. One thing I’ve noticed with Monocypher is that simply reading bytes one by one is slow. So slow in fact that it used to be the main bottleneck for Chacha20 and Poly1305. This is probably the case for CRC-32 as well. Try to load your bytes 4 by 4 (mind alignment) and unroll your main loop accordingly.

Oh, and CRC-32 likely has some real nasty data dependencies, where each operation depends on the previous. One easy way to get around that is to perform 4 CRC-32 in parallel, with interleaved data streams. Think of it like a poor man’s vectorisation. Hmm, actually, since CRC-32 is hardware friendly, you could consider a bitslice implementation, where you’d have up to 32 parallel lanes. I’m going to have to study that…

There's one thing I'd like to ask now: Does your SHA-256 hardware actually only do SHA-256 or can it switch between SHA-256/384/512 at runtime? If so, you could actually swap out Monocypher's EdDSA with BLAKE2 for by-spec Ed25519 with SHA-512 hardware and reap all the benefits.

I'm honestly also surprised that all the 64-bit arithmetic performs acceptably considering it's a 32-bit processor. I really hope there aren't any non-constant-time compiler-generated helper methods in there...

I'm honestly also surprised that all the 64-bit arithmetic performs acceptably considering it's a 32-bit processor.

That's because Arm7 has exactly the instruction we need. See the manual, section A4.4.3: there are four signed 32->64 multiplications in there we can use directly (signed and unsigned, and we have mul-add variants as well). The compiler explorer says GCC makes active use of them.

I believe those instructions are the main reason why Monocypher is such a speed demon on 32-bit ARM.

Hi @fscoto - The SHA hardware supposedly can do SHA1, SHA224, and SHA256. Unfortunately, I have not been able to get it to produce compliant SHA256. I think that something is Endian-swapped in their hardware. I have tried a bunch of permutations, but I have not found the one that gives the expected results yet. Blake2b + EdDSA is fast enough and fits in flash/RAM. For now, I will go with Monocypher's crypto_check and not use the SHA256 hardware.

Good point on the ARM7 instruction set, @LoupVaillant. I took a look at the assembly, and GCC does emit "UMULL" instructions which are u32 x u32 -> u64. FE_CARRY does use i64 x i64 -> i64, but GCC realizes that c * (1 << x) is the same as c << x, so it uses shifts rather than multiplications.

I could definitely be missing something, but I don't see any place in the code that emits 64 x 64 -> 64 or 128.

LoupVaillant / Monocypher

Fixed size_t storage in lock_auth for 32-bit platforms. #241