Organize public-key/X.509 test data for ease of selection and maintenance

The directory tests/data_files contains various kinds of private keys, public keys, certificates, certificate revocation lists, etc. There is no consistent naming scheme, so it's hard to find test data with certain characteristics.

For example: “I want a secp256r1 private key, and an X.509 certificate for its public key signed by an RSA public key” — this turns out to be server7.key and server7.crt (server5.key seems to only be signed with ECDSA). It would be easier to find these files if they were called something like secp256r1-1.key and secp256r1.rsa-ca.crt.pem.

Many of the files are negative test cases. This is often, but not always conveyed in the file name. Even if it is, the names aren't consistent, e.g. an expired certificate might have exp or expired somewhere, or maybe not.

Goal of this issue: to discuss naming conventions. I do not propose to resolve this by mass-renaming. Rather this is one central place to discuss ideas. I propose to discuss rules, then implement them gradually.

I'm rating this task as “size-l” because it should be implemented gradually, not in a single go, although I don't think it's necessarily more than 2 weeks of cumulated work.

When implementing a rule:

Document that rule in Readme-x509.txt.
Rename the affected files to convey the rule.
The change does not have to be exhaustive. For example, if the rule is “syntactically invalid files start with unparseable-”, we don't have to find all unparseable files, we can rename unparseable files when we find them.

We should have some way of identifying files that haven't been sorted out yet. Maybe sort all the files into subdirectories?

I know that what I'm proposing is a lot of work, but maybe we should have a database of some kind? I'll focus on certificates in my examples below. There are many features, elements that come together in a certificate, and the part that decides what the name will be seems to be its purpose / usage, not its exhaustive description. There are certificates that use a certain set of features simply because they were copied from a different test. Treating all of them equally in terms of naming, that is, having one rule to rule them all might complicate things, as we'll convey some (probably) unimportant information, and introduce more clutter.

Given the example of "unparseable" - let's say I sort the certificates by name to find all that are related to the SubjectAltName extension: some will start with "unparseable", others won't.

Sorting into directories seems reasonable for families of tests, this will help us find everything quicker. There are however a lot of certs that are not restricted to one type of test.

Having a database, where we would have things in key=value pairs seems attractive to me, with categories like public key algorithm, issuer, signature algorithm, extension presence, expiry date, etc. We could then filter by the field that's particularly interesting. Any certificate with a given extension? All certificates using a certain signature algorithm? Sure.

The way to achieve it might be hard, but what comes to my mind is: Automate the data collection and have such database updated by script that scans a directory (using mbedTLS / openSSL), gets all the data it can from each cert and later on applies any additional data / overwrites things from a handcrafted file (containing stuff like purpose - where is it used, additional description, or basic data for unparseable certificates - what's malformed). Later on we can have a GUI / cmdline tool to read such database and handle filtering.

With such database any naming convention would be nice to have, but not crucial to solve the task.

As for the naming convention itself: I would opt for directories with certificates / keys dedicated to a single purpose where possible. We could chip them away slowly, with naming inside directories reflecting the purpose in regards to test first, then all attributes (see below). Maybe we could have a readme in each directory shortly describing why such files are created, where are the files used? For multi-purpose certificates / keys, a descriptive name containing all attributes would be nice. Initially I wanted to propose only relevant attributes to be put in the names, but in fact, if someone wants to find all certificates that use a given feature, that's the way to go.

maybe we should have a database

We would need to solve the human-and-technical problem of keeping it up to date. We haven't done a good job of that with the readme file and even with the makefile.

Automate the data collection and have such database updated by script that scans a directory (using mbedTLS / openSSL), gets all the data it can from each cert and later on applies any additional data / overwrites things from a handcrafted file (containing stuff like purpose - where is it used, additional description, or basic data for unparseable certificates - what's malformed).

A lot of files have something wrong with them, or something unusual which the script couldn't recognize. So we'd be relying on that handcrafted file — how do we keep it up to date?

If we manage to at least ensure that all new files, or newly regenerated files, are in the makefile, then I think the makefile would already have this information in a form that isn't easily filtered but is reasonably easy to search by a human.

There are certificates that use a certain set of features simply because they were copied from a different test. Treating all of them equally in terms of naming, that is, having one rule to rule them all might complicate things, as we'll convey some (probably) unimportant information, and introduce more clutter.

I agree that there's a lot of potential information. But we don't need a fully systematic naming convention. Even something basic would be a major improvement, like BASE.EXT or BASE-CHARACTERISTIC.EXT where BASE has a standard form (e.g. indicating key types: rsa2048 for a key, rsa2048-secp256r1 for a certificate), CHARACTERISTIC for non-nominal data, and EXT for the format (I often want to know whether a file is PEM or binary, e.g. because I want to use openssl to look at it), would help.

how do we keep it up to date?

The same way we know that generated files are up to date - with a script that would check if running check_generated_certs_database.py would give a new database. Anyone submitting an updated database needs to provide a description for the newly added certificate.

BASE-CHARACTERISTIC.EXT

What I would add to this is the purpose part, that's what's the most important part now. Inside the directories of test families, PURPOSE-BASE-CHARACTERISTIC.EXT is both readable and grep-able. For a multipurpose certificate, BASE-CHARACTERISTIC.EXT is fine. The .ext part is something we're already enforcing when working on OPC-UA tasks, and we have planned a task for cleaning up past certs related to the SubjectAltName.

Rather than maintaining a database file, I think we can just have this parsing script available and give it options to search, or make it print a greppable output. That way we get searchability without having to keep yet another file up to date. And we don't need to worry about the output format changing slightly.

Mbed-TLS / mbedtls

Organize public-key/X.509 test data for ease of selection and maintenance #7515