PK: study: support SpecifiedECDomain parsing without `ECP_LIGHT`

mpg commented 1 year ago

We should be able to parse SpecifiedECDomain (option MBEDTLS_PK_PARSE_EC_EXTENDED) work without ECP_LIGTH.

The general strategy would be to have a table with the DER encoding of the parameters for each curve, and match against that table instead of parsing.

There are a few questions though:

Is the encoding always DER?
What about the version field?
Optional fields cofactor and hash (probably not a problem since they're at the end)?
Optional field seed inside Curve (more of a problem, right in the middle, though it's mandatory with v2 and v3)?

Probably a hybrid approach would work, where we only parse the outer sequence and the version field, then opaquely match the rest according to the version field?

(I'm labeling "needs design approval" to reflect the fact that more design discussion is needed before starting the implementation.)

The goal of this task is to define a strategy and create a series of estimated task implementing it.

DemiMarie commented 1 year ago

The same table-based approach can and should be used to parse AlgorithmId in general. Support for SpecifiedECDomain should require opt-in at runtime.

Is the encoding always DER?

In a well-formed certificate, yes.

What about the version field?

No idea as I have never worked with such certificates.

Optional fields cofactor and hash (probably not a problem since they're at the end)?

Will these always be determined by the other fields?

Optional field seed inside Curve (more of a problem, right in the middle, though it's mandatory with v2 and v3)?

Is this used in practice? What is the correct value?

@smlu @psvs @ronf: what values have you seen? (Context: https://github.com/pyca/cryptography/issues/5659)

smlu commented 1 year ago

Here are 2 test vectors of DER encoded SpecifiedECDomain objects for P-256 and brainpoolP384r1:

P-256
-----
3081e0020101302c06072a8648ce3d0101022100ffffffff00000001000000000000000000000000ffffffffffffffffffffffff30440420ffffffff00000001000000000000000000000000fffffffffffffffffffffffc04205ac635d8aa3a93e7b3ebbd55769886bc651d06b0cc53b0f63bce3c3e27d2604b0441046b17d1f2e12c4247f8bce6e563a440f277037d812deb33a0f4a13945d898c2964fe342e2fe1a7f9b8ee7eb4a7c0f9e162bce33576b315ececbb6406837bf51f5022100ffffffff00000000ffffffffffffffffbce6faada7179e84f3b9cac2fc632551020101

brainpoolP384r1
---------------
30820140020101303c06072a8648ce3d01010231008cb91e82a3386d280f5d6f7e50e641df152f7109ed5456b412b1da197fb71123acd3a729901d1a71874700133107ec53306404307bc382c63d8c150c3c72080ace05afa0c2bea28e4fb22787139165efba91f90f8aa5814a503ad4eb04a8c7dd22ce2826043004a8c7dd22ce28268b39b55416f0447c2fb77de107dcd2a62e880ea53eeb62d57cb4390295dbc9943ab78696fa504c110461041d1c64f068cf45ffa2a63a81b7c13f6b8847a3e77ef14fe3db7fcafe0cbd10e8e826e03436d646aaef87b2e247d4af1e8abe1d7520f9c2a45cb1eb8e95cfd55262b70b29feec5864e19c054ff99129280e4646217791811142820341263c53150231008cb91e82a3386d280f5d6f7e50e641df152f7109ed5456b31f166e6cac0425a7cf3ab6af6b7fc3103b883202e9046565020101

ronf commented 1 year ago

The AsyncSSH code for this parses the algorithm parameters looking to see if the value is an OID and if so extracts the curve name and returns that if it is one of the curves supported in SSH. If the parameters are not an OID, it requires they be in the following format:

A sequence of 5 or more items
The first item (version) must be the integer 1
- Versions 2 and 3 are defined, but so far I haven't had a need to add support for those
The second item (fieldID) must be a sequence of exactly two sub-items
- The first sub-item must be the OID for prime fields (1.2.840.10045.1.1)
The third item (curve) must be a sequence of 2 or more items
- The first two sub-items (a and b) of this must be octet strings (FieldElements)
- Any additional sub-items are ignored
The fourth item (base) must be an octet string (in this case, an ECPoint)
The fifth item (order) must be an integer
Any additional top-level items are ignored

If this is the case, AsyncSSH extracts the following:

p from the second sub-item of fieldID
a and b by converting the byte strings in curve to integers
point from base
n from order

It then looks up the tuple (p, a, b, point, n) in the list of known curves supported by SSH and uses that curve name if a match is found.

Note that optional fields in the middle aren't really a problem, as each sequence and sub-sequence has length fields to tell you when they end. In my case, I have a generic ASN.1 decoder that will create a Python representation of the decoded data that automatically take that into account when parsing, so I get a nested set of tuples with everything at the right level.

DemiMarie commented 1 year ago

What is the seed field for?

ronf commented 1 year ago

I’m not a cryptographer, but from what I’ve read the optional “seed” in the EC domain parameters relates to curves that were generated at random in a verifiable way. I think the idea is that you can use the seed to regenerate the set of random values used to build the curve parameters, thus verifying that the parameters were produced from a pseudo-random sequence and not manually crafted in some way that might compromise the strength of the curve.

The X9.63 spec discusses how to generate curve “random” verifiable curve parameters in Annex A.3.3 (“Selecting an Elliptic Curve Verifiably at Random”.

mpg commented 1 year ago

Thank you all for your inputs!

@ronf This is pretty much what we're currently doing as well. (Except we accept any version number up to 3, good to know that apparently only 1 is used.)

But we're going to have to change a bit how we do things (at least in some configurations) and in this context @DemiMarie suggested that instead of actually parsing the SpecifiedECDomain structure, we could just have a table with "the" SpecifiedECDomain encoding of each curve and look up the structure in that table. If it worked, that would be a nice simplification to the code, but that works best if each curve has a canonical SpecifiedECDomain encoding (or a low number of possible encodings or prefixes), which unfortunately doesn't seem to be the case here.

Note that optional fields in the middle aren't really a problem, as each sequence and sub-sequence has length fields to tell you when they end.

Yes, they're not a problem when actually parsing the structure, but they would be a problem if we tried to replace parsing with a simple table lookup. It seems to be that optional fields at the end are less of a problem for a lookup based strategy, as we could just loop up the prefix before them.

So, we have the following sources of variation:

version field (even though it looks like in practice 1 is enough, so far we're accepting 2 and 3 so I'm a bit reluctant to change that);
optional fields cofactor and hash at the end of the top-level sequence (we're already ignoring them);
optional field seed in the middle (again, already ignoring it except for validating it's an octet string if present);
the ECpoint might be encoded in compressed or uncompressed form (which forgot yesterday but remembered when looking at our code).

I think that's a bit too many sources of variation for a purely lookup-base strategy, but perhaps a hybrid strategy could work:

parse the top-level sequence tag
parse the first item (version)
match the 2nd item (field) against a table -> unless I'm mistaken at this point we already know the curve
parse the 3rd item's sequence but not its content, just verify that a prefix of it matches the expected value based on the curve guessed previously then skip to its end
parse the 4th item's octet string and match the ECpoint against the expected values for X and (sign of) Y.
match the 5th item against the expected value

This would reduce the amount of parsing, and also I think RAM usage compared to our current strategy (where we allocate for P, A, B, G.X, G.Y, N and only free them at the end).

Also, circling back to what initially made me re-examine this part of the code - that is, the desire to have this work without ECP_LIGHT - the various tables used (one for P, one for A + B, one for G, one for N) could be stored in PK_PARSE_C when ECP_LIGHT is not defined, and derived from the data in ecp_curves.c when ECP_LIGHT is present.

ronf commented 1 year ago

If you don't mind hard-coding DER octet strings in your source, you could definitely keep tables of DER-encoded versions of most of the fields, rather than the raw integer or point values that I keep in AsyncSSH. You could even possibly keep the raw values but encode DER versions of them at startup, to make the lookups simpler later. In the case where optional parameters were present, you could match on just a portion of the value, but you would have to skip over the length field in the SEQUENCE encoding before starting your comparison, as that length would change depending on the length of the optional fields. That could be a bit messy to do without reimplementing a small amount of the ASN parsing, especially if you wanted to make sure the sequence length was long enough to contain the portion you were matching on.

As for space, that hasn't been a major consideration in AsyncSSH, but the space to store the table values in DER form or raw form is about the same and I think you should be able to discard everything but the resolved name once a match is found. So, space usage should be minimal.

mpg commented 1 year ago

The goal of this task is to define a strategy and create a series of estimated task implementing it.

See #7779 and #7789, so, closing this issue.

Mbed-TLS / mbedtls

PK: study: support SpecifiedECDomain parsing without `ECP_LIGHT` #7628