Advance warning for test certificate expiry: design

Mbed-TLS / mbedtls

An open source, portable, easy to use, readable and flexible TLS library, and reference implementation of the PSA Cryptography API. Releases are on a varying cadence, typically around 3 - 6 months between releases.

https://www.trustedfirmware.org/projects/mbed-tls/

Other

5.05k stars 2.51k forks source link

Advance warning for test certificate expiry: design #7018

Open gilles-peskine-arm opened 1 year ago

gilles-peskine-arm commented 1 year ago

The X.509 and TLS tests in Mbed TLS use X.509 data (certificates, CRL, CSR, …) with a validity date. Such data is only valid within a certain validity interval. For example:

$ openssl x509 -text -in tests/data_files/test-int-ca3.crt | grep -A2 Validity
        Validity
            Not Before: Sep  1 14:08:43 2015 GMT
            Not After : Aug 29 14:08:43 2025 GMT

So any test involving test-int-ca3.crt will fail if run after August 2025.

If we do nothing, all will be fine until 29 August 2025, and then all CI jobs will fail.

At least twice in the past (2019, 2023), we were only saved from being crippled by sudden failing tests thanks to the SuSE maintainer who gave us advance warning.

The goal of this issue is to set up an automated job that lets us know a few months in advance before test data becomes invalid. This cannot just be a failure of the regular CI, since we'd be back to the problem of having a failure on a certain date. This can be a separate CI job where we're ok if it keeps failing for a while.

Two possible approaches:

Use the certificate audit script and yell when an expiry date is approaching. Upside: quick. Downside: may miss some malformed files (designed to exercise error handling in parsers). Downside: we would need to filter out test cases that are deliberately passing expired certificates.
Run a subset of the CI under faketime -f '+6 months' (a subset that exercises all the test data is enough, we don't need much more than the full config and one without MBEDTLS_USE_PSA_CRYPTO). Upside: exactly implements the goal “the CI must pass in 6 months”, does not rely on the accuracy of the audit script. Downside: uses resources.

Note that this not just an architectural problem, it's also a process problem. This is going to remain dormant for years. We need to ensure that when it becomes relevant, 1. the job will yell and 2. we will notice.

yuhaoth commented 1 year ago

Some ideas about that.

Need a pr-head component to check if the PR breaks rulers of certificate files.
Need a pr-merge component to check if there are invalid certificates.
audit script should provide the status of cert/crl at given time.
Provide lists for future and past certificate/crl to identify if the file is valid.

I am thinking about the rulers of certificate files, it is not clearly now.

yuhaoth commented 1 year ago

[ ] Need a nightly job to check if some tests will fail within next months(next 6 months?)

yuhaoth commented 1 year ago

[ ] Remove files are not used by tests.

yuhaoth commented 1 year ago

Ideas about this topic

Add nightly build faketime CI jobs. The Jenkins should create jobs like below. The jobs should run components with faketime -f '+3m'. The jobs should be run once every week.
```
faketime_jobs=output_of("tests/scripts/all.sh --list-faketime-jobs")
for job in faketime_jobs:
 start_jenkins_job "ENABLE_FAKETIME=1 tests/scripts/all.sh {job}"
```
Add component-check-data-files
1. Both pr-merge and pr-head should run it.
2. data_files/* should not be update after run make -C tests/data_files
3. No untracked-files in data_files available after make clean
4. Others?
Add component-faketime-full-gcc , does it make sense ?

Other issue:

Cleanup Makefile:
1. Add missing commands.
2. Replace Openssl command with MBEDTL_CERT_WRITE if possible, does it make sense?
Organize the files: --remove intermediate files from git repo--. move openssl config file and other config files into sub folder.

gilles-peskine-arm commented 1 year ago

Add nightly build faketime CI jobs.

Failures of the faketime job are not critical. So I think it should be completely separate from the normal nightly job. If the normal nightly job fails, we fix it ASAP.

Currently faketime is the only non-critical thing we want to test. But I think there will be others in the future. For example, test with pre-release versions of compilers.

faketime_jobs=output_of("tests/scripts/all.sh --list-faketime-jobs")

Minor: I think this should be a separate script rather than all.sh. We're going to split all.sh into an engine part and a job list part anyway, for the repository split. Other scripts can share the job list and the engine, such as the faketime-test script.

Add component-check-data-files

data_files/* should not be update after run make -C tests/data_files

That would be good, but I'm not sure we're ready for it. I'm worried that it might be unreliable: make uses file timestamps to decide what to run, and depending on the exact order in which files are checked out and the granularity of the filesystem's timestamps, the checked out files may end up being in a different order on different runs.

Add component-faketime-full-gcc , does it make sense ?

We can't do that in all.sh because the CI has a very useful property: it checks that in the PR job, every component of all.sh runs at least once. We don't want to run faketime-specific jobs.

Replace Openssl command with MBEDTL_CERT_WRITE if possible, does it make sense?

On the contrary, I prefer to use non-mbedtls tools to generate test data. This way we know that mbedtls interoperates with openssl (or other). Whenever we use mbedtls itself to generate test data, we lose this knowledge.

remove intermediate files from git repo

I don't think that's useful. Intermediate files can be useful to generate more derived files. At least all files that aren't generated reproducibly (independent of randomness, time and tool version) need to be checked in.

yuhaoth commented 1 year ago

Failures of the faketime job are not critical. So I think it should be completely separate from the normal nightly job. If the normal nightly job fails, we fix it ASAP.

Agree. It should be completely separate.

Minor: I think this should be a separate script rather than all.sh. We're going to split all.sh into an engine part and a job list part anyway, for the repository split. Other scripts can share the job list and the engine, such as the faketime-test script.

Is there any issue for that? I just want to share job list with --list-faketime-jobs. I prefer faketime-test script also.

That would be good, but I'm not sure we're ready for it. I'm worried that it might be unreliable: make uses file timestamps to decide what to run, and depending on the exact order in which files are checked out and the granularity of the filesystem's timestamps, the checked out files may end up being in a different order on different runs.

I am trying to make sure this point. My ideas is set the timestamp to git commit date before make. That asks developer commit files in order. Until now, it looks work.

On the contrary, I prefer to use non-mbedtls tools to generate test data. This way we know that mbedtls interoperates with openssl (or other). Whenever we use mbedtls itself to generate test data, we lose this knowledge.

Yes, that's important, I miss that. How about replace mbedtls with Openssl ?

yuhaoth commented 1 year ago

Update according to @gilles-peskine-arm 's feedback

~~Add nightly build faketime CI jobs.~~ Add separate faketime build job. The Jenkins should create jobs like below. The jobs should run components with faketime -f '+3m'. The jobs should be run once every week.
```
faketime_jobs=output_of("tests/scripts/faketime-all.sh --list")
for job in faketime_jobs:
 start_jenkins_job "ENABLE_FAKETIME=1 tests/scripts/faketime-all.sh {job}"
```
Add component-check-data-files
1. Both pr-merge and pr-head should run it.
2. data_files/* should not be update after run make -C tests/data_files.
  1. Before make, we should set the file timestamp to commit date.
3. No untracked-files in data_files available after make clean
4. Others?
~~Add component-faketime-full-gcc , does it make sense ?~~

Other issue:

Cleanup Makefile:
1. Add missing commands.
2. ~~Replace Openssl command with MBEDTL_CERT_WRITE if possible, does it make sense?~~
3. Replace MBEDTL_CERT_WRITE with Openssl command if possible.
Organize the files: ~~remove intermediate files from git repo~~. move openssl config file and other config files into sub folder.

gilles-peskine-arm commented 1 year ago

We're going to split all.sh into an engine part and a job list part anyway, for the repository split. Other scripts can share the job list and the engine, such as the faketime-test script.

Is there any issue for that?

Not yet. It's part of the plans around the repository split: scripts that are common to both repositories will be split into a common engine part + a “configuration” in each repository.

How about replace mbedtls with Openssl ?

I think that's a good long-term objective, but not realistic in the short term. Also, there are files for which we don't care because we have similar content in the unit tests. I think generally for keys we don't care, but for some more complex content like non-self-signed certificates it's good to have a wide range of data produced by OpenSSL, because OpenSSL produces most certificates in the real world.

About the updated plan: that sounds good. There are different priorities though. My top priority is setting up the faketime job, that's part of our nice-to-have objective for this quarter. Then I think adding missing commands to the makefile is the most important thing. I'd like to get to a point where we can delete everything in tests/data_files except Makefile and we run make and there are no missing files — the files won't be the same as the repository (different time, different randomness), but should be equivalent. But that's not an objective for this quarter, and we have to balance it with the many other bits of technical debt.

yuhaoth commented 1 year ago

I just create https://github.com/Mbed-TLS/mbedtls-test/issues/106, #7729, #7730, #7731, #7732 and label them. I am not sure if the label and priority are right, please correct me.

About point 1 in #7730, I am trying to confirm if we can implement it. About https://github.com/Mbed-TLS/mbedtls-test/issues/106, I need help, I do not know Jenkins scripts

I'd like to get to a point where we can delete everything in tests/data_files except Makefile and we run make and there are no missing files

I put this point in #7732 . I think not all files should be deleted, some files are input files. We should category them before do that.