Mbed-TLS / mbedtls

An open source, portable, easy to use, readable and flexible TLS library, and reference implementation of the PSA Cryptography API. Releases are on a varying cadence, typically around 3 - 6 months between releases.
https://www.trustedfirmware.org/projects/mbed-tls/
Other
5.54k stars 2.6k forks source link

driver-only ECDSA: fix testing disparities in ecp, random, se_driver_hal #6856

Closed mpg closed 1 year ago

mpg commented 1 year ago

Context: https://github.com/Mbed-TLS/mbedtls/issues/6839.

This task is to finish off TL-ecdsa.PSA by achieving testing parity in "basic" test suites (that is, everything except PK, X.509, TLS). Ideally, once this is done, in analyze_outcomes.py, the task analyze_driver_vs_reference_ecdsa will no longer ignore test suites other than PK, X.509, TLS.

Random ideas about mbedtls_psa_get_random_ecdsa_sign:

Depends on: https://github.com/Mbed-TLS/mbedtls/pull/6855

gilles-peskine-arm commented 1 year ago

The test serves a purpose: checking that mbedtls_psa_get_random and MBEDTLS_PSA_RANDOM_STATE can be used with legacy APIs that accept f_rng, p_rng parameters. So, we may not want to just remove that test.

Indeed.

We could also just accept the testing disparity as harmless. That would mean leaving the ignore entry for test_suite_random in analyze_outcomes.py, with a comment explaining why it's harmless.

This has my preference. The cost is minimal. Unless we really expect to have no other disparities, which would save us from supporting an exception mechanism? Is it hard to support exceptions?

We could instead move the test to test_suite_ecdsa which is legitimately ignored already. This is not really satisfying, as the test logically belongs to test_suite_random.

Indeed. It's testing mbedtls_psa_get_random, and the involvement of ECDSA is anecdotic.

We could also use something other than ECDSA in the role of "legacy API that takes f_rng, p_rng"

That would be fine. It should be something where all-bits-zero (a plausible failure mode) causes a failure (which is the case for both ECC blinding and randomized ECDSA k generation, and also for DHM blinding, and for RSA key generation but it's very slow, but I'm not sure about RSA blinding, and it's not the case for RSA encryption nonce or RSA padding).

mpg commented 1 year ago

We could also just accept the testing disparity as harmless. That would mean leaving the ignore entry for test_suite_random in analyze_outcomes.py, with a comment explaining why it's harmless.

This has my preference. The cost is minimal. Unless we really expect to have no other disparities, which would save us from supporting an exception mechanism? Is it hard to support exceptions?

We already have support or test-suite-level exceptions and we need it because we always want to ignore the test suite of the thing that's being accelerated: for example when ECDSA is accelerated clearly we expect the tests in test_suite_ecdsa to be skipped. (Support for exceptions is also super useful in the development process, because we can start by adding the tests we want, but more exceptions than warranted, then gradually work towards removing them, see #6855 and its follow-ups - you could call it test-driven task breakdown.)

What we don't have, is test-case-level exceptions where we could say it's OK for this case in particular to be skipped. But here I think it's OK to have an exception for the full test_suite_random anyway.

That would be fine. It should be something where all-bits-zero (a plausible failure mode) causes a failure (which is the case for both ECC blinding and randomized ECDSA k generation, and also for DHM blinding, and for RSA key generation but it's very slow, but I'm not sure about RSA blinding, and it's not the case for RSA encryption nonce or RSA padding).

Good point about RSA being slow (as well as DHM parameter generation).

gilles-peskine-arm commented 1 year ago

But here I think it's OK to have an exception for the full test_suite_random anyway.

I disagree. That would be problematic start working on PSA random driver support.

mpg commented 1 year ago

Should have made a complete sentence: I think it's OK to have an exception for the full test_suite_random in driver-only ECC builds anyway. Exceptions are per pair all.sh components to compare. Obviously in builds focused on PSA random drivers that wouldn't be OK.

Anyway, I've been thinking and at some point we'll probably want to expand exceptions support to target specific test cases anyway, and I don't think it would be that hard. (For example at some point in PK parse we'll want to specifically ignore test cases that rely on compressed points, but definitely not ignore the full suite.) So, we might as well do that extension now and ignore only that specific test case here too.

valeriosetti commented 1 year ago

I'm not an expert at all, but since I'm currently trying to address this issue I just give my 2cents.

At first I was more OK with this solution:

We could also use something other than ECDSA in the role of "legacy API that takes f_rng, p_rng". [...] But we could use RSA or DHM instead.

However I see that there are concerns about it as pointed out here.

As a consequence my question goes to this sentence:

I've been thinking and at some point we'll probably want to expand exceptions support to target specific test cases anyway, and I don't think it would be that hard

So, if I understood correctly, this means that I should extend analyze_outcomes.py in order to allow exceptions for specific test inside a certain suite. Am I right? If this is the case, then:

valeriosetti commented 1 year ago

A tentative change to analyze_outcomes.py was proposed in the connected PR at this commit

gilles-peskine-arm commented 1 year ago

So, if I understood correctly, this means that I should extend analyze_outcomes.py in order to allow exceptions for specific test inside a certain suite. Am I right?

Yes.

can I do this directly in this PR to make things faster?

It'll only make things faster if everyone is happy with the design.

I don't think including test case descriptions in the script will work well. Someone might tweak the description (e.g. fix a typo, or make it more distinctive when adding a similar test case), and there's basically no way to know the test case is referenced elsewhere. I'm not even overjoyed at hard-coding test suite names.

I'd prefer a mechanism where the test suite/case itself has an annotation that says THIS_TEST_IS_EXCLUDED_FROM_DRIVER_TEST_COVERAGE_PARITY. How about adding this as a “fake” depency to depends_on declarations? This requires no change to the test framework and has a familiar interface. The missing piece is for the Python code to understand test suite dependencies and test case dependencies, which can be done by leveraging some of the parsing that's already implemented in generate_test_cases.py.

Note — to smoothe development, we can have the simple mechanism with a hard-coded name in this PR, and then make a follow-up with a cleaner mechanism.

valeriosetti commented 1 year ago

It'll only make things faster if everyone is happy with the design.

Yes, of course ;). I didn't want to sound disrespectful here, I just wanted to say that I can take care of this directly in this PR without opening another one just to fix this part

Someone might tweak the description (e.g. fix a typo, or make it more distinctive when adding a similar test case), and there's basically no way to know the test case is referenced elsewhere

You're right, the proposed solution is not super safe

How about adding this as a “fake” depency to depends_on declarations?

I agree, this looks like a better approach. I'll give it a try!

mpg commented 1 year ago

@gilles-peskine-arm

I don't think including test case descriptions in the script will work well. Someone might tweak the description (e.g. fix a typo, or make it more distinctive when adding a similar test case), and there's basically no way to know the test case is referenced elsewhere. I'm not even overjoyed at hard-coding test suite names.

As a matter of principle I'm not a huge fan of referring to test cases by their description, as you say someone might tweak it, but then what would happen? outcome_analysis.py will not see the new description in its ignore list, so it will not ignore it, notice a disparity in coverage, exit non-zero, and the CI will fail. So we're bound to notice and fix it.

So, while referring to things by name is not great in general, I think the special case of an ignore list in a script run in the CI is actually quite OK. The worst that can happen is we waste a push-wait-for-CI-rework cycle on this. I don't think we tune descriptions often enough that the amount of cycles wasted would be unacceptable.

I'd prefer a mechanism where the test suite/case itself has an annotation that says THIS_TEST_IS_EXCLUDED_FROM_DRIVER_TEST_COVERAGE_PARITY.

I see two problems with this. First, the annotation needs to be more precise: we don't want to exclude this test form parity checking when testing accelerated hashes from example. This is of course trivial to fix by appending _FOR_ECDSA (or similar) to the name used.

The second problem is related, and involves a plan I had in my head but hadn't written so far: in the long run I think we can save CI resources by having a single "reference" component that would be common to all "accelerated" components. Currently this is not possible the reference component for accelerated hashes disables the entropy module and uses an external RNG, while the ECDSA reference component disables ECDSA-based key exchanges, but once hashes are improved and ECDSA has progressed enough, both should be able to use the same reference component, which would be just full config + drivers enabled. But this doesn't work as well if we skip test cases in the reference component. Though that's of course not a problem now, so we could use a simpler approach now and worry about that later.

@valeriosetti quoting from https://github.com/Mbed-TLS/mbedtls/pull/6946#issuecomment-1399996999

instead of adding a new possible option for depends_on in the .data file, which didn't look very good to me, I preferred to add a new "line" to the .data file called test_options

I'm not sure why re-using the existing depends_on line doesn't look good, can you expand on that? Your solution adds a bit of complexity, and it's not immediately clear to me what justifies this added complexity.

valeriosetti commented 1 year ago

I'm not sure why re-using the existing depends_on line doesn't look good, can you expand on that?

In my opinion a "dependency" is something that is needed for building or running the library/test-code. In this case we would like to skip a test because it can be built/tested in one case, but not in the other, so the cross-check does not work properly. However the single build, considered independently, is OK. Besides this new symbol is not something defined in some header file, but something that makes sense only for the test purpose.

However this just my interpretation, of course, so I'm open to any different approach ;)

gilles-peskine-arm commented 1 year ago

First, the annotation needs to be more precise: we don't want to exclude this test form parity checking when testing accelerated hashes from example.

Does that really matter? The test case will have a more specific dependency on some non-PSA symbol. The extra symbol to exclude from parity checking is just there to say “it's not a mistake to have a non-PSA dependency”.

we can save CI resources by having a single "reference" component that would be common to all "accelerated" components. (…) But this doesn't work as well if we skip test cases in the reference component.

I don't understand this part. What wouldn't work with an everything-through-driver component?

mpg commented 1 year ago

First, the annotation needs to be more precise: we don't want to exclude this test form parity checking when testing accelerated hashes from example.

Does that really matter? The test case will have a more specific dependency on some non-PSA symbol. The extra symbol to exclude from parity checking is just there to say “it's not a mistake to have a non-PSA dependency”.

I don't think the extra symbol is here just for that. We already have a non-PSA dependency indeed and it's not a mistake indeed. But what analyze_outcomes.py is doing is comparing the sets of test skipped in a build with ECDSA_C vs in a build without ECDSA_C (but ECDSA provided by a driver). Currently, it sees a test skipped in the second build but not the first, so it complains, that's its job. We want to tell the script "when comparing those two builds, don't worry about this case", because we've manually checked that this disparity is not an indication of anything wrong in ECDSA driver support.

But when the script is comparing "hashes in software" vs "accelerated hashes" build, if the test mbedtls_psa_get_random_ecdsa_sign is skipped in the accelerated build and not the software build, I want the script to let me know because that's definitely not expected, so likely an indication of a problem somewhere.

Basically the set of expected & acceptable disparities depends on the pair of builds being compared.

On another front, what do you think about my point that referring to a case by its description would be acceptable in the specific case of an ignore list, since any change in the description would be noticed as a result of the script failing?

gilles-peskine-arm commented 1 year ago

The implementation of driver parity testing feels more complex than I'd expect. In particular, I wonder why “no non-PSA dependencies in the PSA domain (except explicit exceptions)” is not a good enough approximation. Is there a design analysis document that lists the requirements and derives the architecture?

I don't think the extra symbol is here just for that. We already have a non-PSA dependency indeed and it's not a mistake indeed.

Hmmm. A test is skipped only due to a dependency or to a runtime decision to skip. For a test to not run with drivers, one of the following has to happen:

  1. The test case depends on a non-PSA crypto feature.
  2. The test case depends on a combination of compilation options that is never reached in any driver build.
  3. The test case is skipped at runtime with a condition that is true in at least one non-driver build and false in all driver builds.

I thought the goal of the test parity check was primarily (2), since (1) could be enforced through simple dependency checks and (3) is rare enough to not really matter. So for (2), is it really important which driver-based build gets the test to run? Especially when we can do all-driver builds (as opposed to separating algorithms due to incomplete support): wouldn't this reduce (2) to an analysis of test case dependencies?

Basically the set of expected & acceptable disparities depends on the pair of builds being compared.

Well, ok, that's how it's currently implemented. But why do we need this level of precision?

what do you think about my point that referring to a case by its description would be acceptable in the specific case of an ignore list, since any change in the description would be noticed as a result of the script failing?

That's not great, but it's acceptable.