coconut-svsm / svsm

COCONUT-SVSM
MIT License
122 stars 43 forks source link

cpu/apic: Clean up APIC emulation code #386

Closed msft-jlange closed 5 months ago

msft-jlange commented 5 months ago

This change addresses review feedback that was posted on the APIC emulation PR after it merged.

msft-jlange commented 5 months ago

On the topic of error handling, I noticed the following in the latest version of the Alternate Injection spec you sent.

What you highlighted is clearly inconsistent and needs to be fixed. I will plan to revise the document.

The correct behavior is to return SVSM_ERR_UNSUPPORTED_PROTOCOL once emulation has been disabled on the calling vCPU. For protocol handling, this check is performed in apic_protocol_request(), so that a consistent error can be returned. Since this is the case, there isn't really any reason to do additional enablement checking in any of the specific protocol handlers because nothing can change between this outermost check and the individual operations. That means that the change to use Option<Result<..>> isn't really necessary - or even useful - because none of those code paths will ever be able to generate None. Consequently, I think it's better for the individual handlers to just assume that APIC emulation is active, and to use unwrap() to access the local APIC from the Option<LocalApic>. This will panic if these functions are called unexpectedly when APIC emulation is inactive, but that's a good thing, since we're trying to establish the pattern that they should never be called unless APIC emulation is known to be active.

I think it is preferable to generate SVSM_ERR_UNSUPPORTED_PROTOCOL from a single point rather than requiring each individual function handler to know what error is expected when emulation is inactive.

00xc commented 5 months ago

I think it is preferable to generate SVSM_ERR_UNSUPPORTED_PROTOCOL from a single point rather than requiring each individual function handler to know what error is expected when emulation is inactive.

The problem is, as it is, we are checking twice whether the Option is None or not, and that has two meanings: either APIC emulation was never enabled (restricted injection was not enabled) or it was disabled after a request. We need the following to match the spec:

  1. If restricted injection is disabled, return UNSUPPORTED_PROTOCOL.
  2. If APIC is None, return INVALID_REQUEST.
  3. If APIC failed to service request, return INVALID_PARAMETER.
  4. Otherwise, return Ok(()).
msft-jlange commented 5 months ago

The problem is, as it is, we are checking twice whether the Option is None or not, and that has two meanings: either APIC emulation was never enabled (restricted injection was not enabled) or it was disabled after a request. We need the following to match the spec:

First, the spec is wrong and has been updated (but not yet published). The correct behavior is to return SVSM_ERR_UNSUPPORTED_PROTOCOL in all cases where APIC emulation is not enabled (either because it was never enabled or because it was disabled at some point). The spec no longer contemplates the possibility of SVSM_ERR_INVALID_REQUEST, only of SVSM_ERR_INVALID_PARAMETER.

Beyond that, it seems we have a clear difference of opinion. You are optimizing to minimize the number of times the Option is checked. I am optimizing to minimize the number of places that the code has to be aware of the circumstances in which SVSM_ERR_UNSUPPORTED_PROTOCOL must be detected. My opinion is that, on balance, the code is more maintainable if the is-enabled check is done once at the top of the protocol handler, so that any future protocol commands do not require awareness of the is-enabled checking, and the work required to check the Option a second time is negligible and should not factor into performance/size calculations. So I'm tilting in favor of a code base that I believe can be more easily maintained.

What we do clearly agree on is that switching to Option was a real improvement in safety because there is no longer any ambiguity about whether the APIC state has meaning, and that's something this PR clearly accomplishes.

00xc commented 5 months ago

The problem is, as it is, we are checking twice whether the Option is None or not, and that has two meanings: either APIC emulation was never enabled (restricted injection was not enabled) or it was disabled after a request. We need the following to match the spec:

First, the spec is wrong and has been updated (but not yet published). The correct behavior is to return SVSM_ERR_UNSUPPORTED_PROTOCOL in all cases where APIC emulation is not enabled (either because it was never enabled or because it was disabled at some point). The spec no longer contemplates the possibility of SVSM_ERR_INVALID_REQUEST, only of SVSM_ERR_INVALID_PARAMETER.

Then my point still stands, but we simply omit step 1.

Beyond that, it seems we have a clear difference of opinion. You are optimizing to minimize the number of times the Option is checked. I am optimizing to minimize the number of places that the code has to be aware of the circumstances in which SVSM_ERR_UNSUPPORTED_PROTOCOL must be detected. My opinion is that, on balance, the code is more maintainable if the is-enabled check is done once at the top of the protocol handler, so that any future protocol commands do not require awareness of the is-enabled checking, and the work required to check the Option a second time is negligible and should not factor into performance/size calculations. So I'm tilting in favor of a code base that I believe can be more easily maintained.

I simply don't see how doing is_some() upfront and then unwrap() in a completely separate function, under the assumption that the check happens somewhere else, is more maintainable. It's not about runtime performance, it's about robustness.

You can actually have both goals: remove the first check, let the PerCpu methods return Option<Result<...>>, and let the top-level APIC handler decide what to do when the Option is None. This removes redundant checking and has the property you want to optimize for.

What we do clearly agree on is that switching to Option was a real improvement in safety because there is no longer any ambiguity about whether the APIC state has meaning, and that's something this PR clearly accomplishes.

Yes, but it does the equivalent of:

if some_option.is_some() {
    some_option.unwrap().some_method();
} else {
    Err(some_error)
}

Which you can clearly see why is worse. It's not even that, because the unwrap() happens on a function on a completely different file, under the assumption that it will never be called because the first check happens somewhere else. This is less maintainable in my humble opinion.

msft-jlange commented 5 months ago

Yes, but it does the equivalent of:

if some_option.is_some() {
    some_option.unwrap().some_method();
} else {
    Err(some_error)
}

In at least one case, it's subtly different than that:

if some_option.is_some() {
    let params = process_protocol_parameters()?; // error possible on parameter checking but parameter checking should only be attempted if there is an object worth operating on
    some_option.unwrap().some_method(params)
} else {
    Err(some_error)
}

(cf. apic_configure_vector(), which I now see has yet another redundant check that should be cleaned up.)

Yes, it's possible to flow the Option<Result<..>> all the way to the top of the call chain, but that would mean that every new call that's added to the APIC protocol handler would require the author to remember to include the correct logic for flowing the optional result back in the right way. Contrast that with the approach of having is_some() at the top of apic_protocol_request() which provides blanket error handling, making it impossible for any future call handler to get it wrong. I really like code structures that make it impossible for anybody to make mistakes when extending existing code.

msft-jlange commented 5 months ago

I really like code structures that make it impossible for anybody to make mistakes when extending existing code.

As a case in point, I just tried putting together a commit that flowed the Option<Result<..>> all the way back up to apic_protocol_request() and it took me three tries to get the semantics correct. I made a mistake in apic_query_features() when I failed to check for APIC emulation active before returning Some(Ok(())), and I made a mistake in apic_configure() when I failed to check for APIC emulation active before calling through SVSM_PLATFORM. If I, as the person who wrote this code, can make such serious mistakes so easily, then why would we have faith that somebody even less familiar with the APIC code would get new functions correct? This is exactly what I mean when I talk about structuring the code to make a whole class of mistakes impossible.

00xc commented 5 months ago

I really like code structures that make it impossible for anybody to make mistakes when extending existing code.

As a case in point, I just tried putting together a commit that flowed the Option<Result<..>> all the way back up to apic_protocol_request() and it took me three tries to get the semantics correct. I made a mistake in apic_query_features() when I failed to check for APIC emulation active before returning Some(Ok(())), and I made a mistake in apic_configure() when I failed to check for APIC emulation active before calling through SVSM_PLATFORM. If I, as the person who wrote this code, can make such serious mistakes so easily, then why would we have faith that somebody even less familiar with the APIC code would get new functions correct? This is exactly what I mean when I talk about structuring the code to make a whole class of mistakes impossible.

I'd counter by saying this does make future mistakes possible by calling the functions that do unwrap() under the assumptions I detailed above. But okay, let's merge this and we can further improve it if we find a way.

00xc commented 5 months ago

Maybe for another PR, but it seems these errors are not contemplated in the latest draft of the spec, or at least not with that name:

https://github.com/coconut-svsm/svsm/blob/f196bdcbe56a894d943bdafb8a98705d20a63074/kernel/src/protocols/apic.rs#L26-L27

msft-jlange commented 5 months ago

Maybe for another PR, but it seems these errors are not contemplated in the latest draft of the spec, or at least not with that name:

Wow, you've caught the fact that the registration design is based on an old version of the spec and has nothing to do with how it is described in the current specification. I thought I had rewritten the code at the same time that I revised the spec, but apparently that got lost, or I only dreamt the fact that I wrote the code. That needs to be overhauled, but I think it would be better to do that as part of a separate PR.

Thanks for catching this glaring mistake!

msft-jlange commented 5 months ago

@00xc can you take another look? I had to update because the previous version was unwrapping Option<LocalApic> causing a copy of the APIC, rather than unwrapping Option<&LocalApic> which would permit operating on the actual APIC state. I'm inhibiting Copy on LocalApic so there are no additional surprises due to inappropriate copies.

00xc commented 5 months ago

Please resolve conflicts and I'll give it a final look.