GaloisInc / BESSPIN-Tool-Suite

The core tool of the BESSPIN Framework.
Other
6 stars 2 forks source link

Segmentation Fault Scoring Philosophy #1121

Closed njshanahan closed 3 years ago

njshanahan commented 3 years ago

As a follow-on to #1015, I wanted to ask generally about the segmentation fault scoring philosophy.

If a buffer error test segmentation faults, BESSPIN will report one less test as "uncaught" (i.e. credit is received). However, if a resource management test produces a segmentation fault(s) (e.g. CWE-476 and CWE-587), the score is reported as HIGH. Is this difference intentional? Thanks!

Including @austinhroach in the discussion.

rtadros125 commented 3 years ago

This is a great point, and my initial thought is that you're right; the scoring of buffer errors does not seem to comply with the BESSPIN philosophy document or with the scoring of the other classes.

@bboston7, I am interested to hear what you think since you worked with the buffer errors class more closely than I did.

@abakst (Apologies to bring you into this discussion, but I believe your input would be of a high value as you contributed the most to the buffer errors part. If you're busy with smth else, I totally understand). Please read this paragraph first, then let us know what you think about us changing the scoring and considering SEGFAULT or any other kernel signal as HIGH instead of uncaught.

@njshanahan I am interested to hear your opinion as well. I see that you pointed out a discrepancy, but you didn't weigh in on what I summarized in the BESSPIN philosophy document (which I believe is a brief conclusion of 2 years of discussions). Do you agree with changing the scoring of SEGFAULTs in buffer errors if this the course of action we decide to take?

CC @kiniry @dmzimmerman for visibility.

abakst commented 3 years ago

Some thoughts relevant to the scoring philosophy and the question in this issue:

I think the relevant problems are (1) how to interpret the test results, and (2) what the "weakness" is.

Starting with (2), in the case of buffer error tests, the weakness is that the software reads outside the bounds of some object (e.g. an allocated buffer in a C program). I don't think it's wrong to interpret the baseline OS's behavior as correctly detecting the weakness, as the OS is charged with (among other things), guaranteeing process isolation -- in this case, preventing processes from reading/writing outside of their allocated memory (albeit now at a much coarser granularity).

Presumably, this is why we want to test different types of accesses ("near" and "far", "small" and "large"), as the OS will typically only track resource allocation at a very coarse level, whereas attackers might exploit weaknesses that persist even at a much finer granularity.

Moving on to (1): as I recall (which is already somewhat dubious), the purpose of the scoring system is not to produce a score that has an interpretation in isolation, but instead allows one to compare different (SSITH + baseline) systems or configurations. So you might be interested in trying to assign a score based on whether or not the behavior differs in some interesting way from a baseline system. I'm not sure what detection system would produce results that are somehow more meaningful than indicating an access violation (in the specific case of buffer errors), but that's probably a failure of my imagination. Perhaps an input to the test is the "expected" behavior (maybe this is already the case?): again, in the specific case of buffer overruns I think a SEGFAULT is reasonable behavior.

It seems this runs a bit counter to the philosophy offered in the document. In that case, I'd suggest some text that would explain why this is the case (or otherwise modifying the document). For example, if I'm not mistaken, one reason the OS may issue a SEGFAULT is if the MMU itself raises a fault. How does this square with the last sentence of the paragraph beginning "A kernel signal is not enough to protect against a specific weakness type?"

In any case, the interpretation should be consistent with how tests for CWE-476 are scored, as CWE-476 describes the action of accessing a specific resource.

As an aside, CWE-587 reads a little bit differently to me: the weakness is in the assignment of the pointer. I think if the program completes the assignment, then that's the weakness: this seems a little different from the sorts of errors we're talking about otherwise.

brooksdavis commented 3 years ago

A kernel signal is not enough to protect against a specific weakness type. This is a more subtle point and led to many discussions. What should the score be if the test for the weakness causes a segmentation fault? For example, CWE-672 is about operating on a resource after expiration or release. One of the tests that cover this CWE is to allocate some memory bytes to a few pointers, then free one of them, and then access the memory location referenced by that freed pointer. In Debian for instance, this behavior leads to a segmentation fault. The question is: should this be considered as weakness type is not present in the OS-processor pair? Or can this be exploited somehow? If a malicious user was able to catch this exception, and proceeds with the program, or tries again a slightly modified pointer value, can they exploit or infer any information about the memory or the rest of the program? Because how these signals can be ignored or caught, we decided that the kernel signal is not enough to protect the system, and that it does not have any impact on security resilience. Unless the processor itself issues a signal or an interrupt, any kernel signal would still score HIGH.

This paragraph needs a major re-think. The way UNIX-likes OSes translate hardware traps to processes is signals. If the programmer catches the signal and continues, that is their intent. With possible exception of some of the PPAS cases, malicious programming is not something we're supposed to be addressing.

As written this literally places both CHERI and Hard-pipelines out of bounds as solutions for buffer errors when used in an OS that uses a process abstraction. That can not possibly be considered a sensible outcome.

bboston7 commented 3 years ago

This is an interesting point. I agree that scoring for the buffer errors tests violates the philosophy document as written. I also think that the tests are giving the most reasonable score here. I agree with both Alex and Brooks that we should update the document.

As an aside, it's more than just the buffer errors tests that score this way. For example, CWE-825 and INJ-1 both score NONE for a segmentation fault (there may be more, these are just two that came to mind). I would also argue that some of the other tests that score HIGH on segmentation faults should score NONE (such as CWE-476).

jrtc27 commented 3 years ago

A segfault due to an out of bounds access is catching a buffer error and should be classed the same as any successful SSITH detection. Claiming anything else is misrepresenting things IMO.

rtadros125 commented 3 years ago

This is a great discussion, and thanks again to @njshanahan for bringing a buried thing to the open. I have opened #1122 to implement the outcome (whichever it is/will be). This ticket's AC should be just a decision.

@austinhroach What do you think? Do we need to schedule a meeting to discuss? How big of a meeting :) (CC @Abivin12 @rfoot for visibility)

brooksdavis commented 3 years ago

As an aside, CWE-587 reads a little bit differently to me: the weakness is in the assignment of the pointer. I think if the program completes the assignment, then that's the weakness: this seems a little different from the sorts of errors we're talking about otherwise.

I think it's important to acknowledge that detection is going to be deferred in practice. With CHERI we defer the fault until later when we try to use the now-invalid pointer. Even in narrow cases like return addresses where the storage could be architectural, deployed systems like CET defer the check to return rather than gating every store on checking a potentially large list of stored return address locations.

brooksdavis commented 3 years ago

It is worth noting that blindly accepting "process got killed by a signal" as success is problematic. If nothing else, when we analyzed our results in Phase 2 we found that we were catching unintended errors in the framework, not the code demonstrating weaknesses. I don't think there's a whole lot the framework can do about that though.

rwatson commented 3 years ago

I don't see that there's a good architecture/software-independent approach for handling this, since there are no standard UNIX APIs for catching signals directly indicating "that was a buffer overflow". However, here's an example of how we perform these kinds of tests in CHERI-aware code (portable across CHERI architectures including MIPS, RISC-V, and ARMv8-A):

https://github.com/CTSRD-CHERI/cheribsd/blob/master/bin/cheribsdtest/cheribsdtest_fault.c

See test_fault_bounds(), for example, where we specify not only that a signal is generated, but also the expected UNIX signal (SIGPROT) and the siginfo cause value (PROT_CHERI_BOUNDS) and architecture-specific trap number (TRAPNO_LOAD_STORE).

But the key thing is: We need the right fault for the right reason. For example, it is important to us that spatial safety generates a bounds exception, rather than pointer integrity generating an untagged capability exception, when there's an attempt to overrun a stack buffer and overwrite the return address.

rwatson commented 3 years ago

(And maybe it's also worth specifically observing about the example I've shown: That's an OS integration level test checking a hardware-software protection property. We have separate low-level architectural testing, using instruction sequences, for checking architectural bounds properties. What we're looking for in this test is that the vulnerability is mitigated at a software level, and properly reported via the signal mechanism.)

austinhroach commented 3 years ago

These are all very good points. From my perspective, I think that in most of these cases the segfaults are artifacts of the tests not being aware of the memory space of a process and not making any attempt to respect the bounds of that memory space. If these were weaknesses that an attacker was an attempting to exploit, the attacker would use some awareness of the memory space to avoid segfaults. So I think the question that we ultimately want to answer is "If this misbehavior occurred within the valid bounds of the memory space of the process, would the architecture detect the misbehavior?"

Something that may help us here is that both of the SSITH Phase 3 architectures use a signal other than SIGSEGV to communicate a security violation. (Someone correct me if I'm wrong here.) So it would seem sensible to me to score killed-by-SIGSEGV as "uncaught", and killed-by-the-relevant-security-fault-signal as "caught" for SSITH. The downside here is the possibility of race conditions in the event that both signals are generated and the SIGSEGV arrives first, but maybe this could be overcome with a SIGSEGV handler if this is a problem in practice.

Thinking more generally than the two SSITH approaches, if there were adoption of testgen for evaluation of some other secure system in the future, it's possible that such a system could communicate security exceptions via SIGSEGV. But if someone wanted to change the scoring to tailor it for that system, they could do so.

There may be exceptions to my reasoning above. In particular, I think a segfault in response to a dereference of a virtual memory null pointer is a sufficient system response, since operating systems avoid mapping memory at virtual memory address 0 specifically to address null-pointer dereferences.

If there is an appetite for a group meeting to talk through these issues, I would happily participate.

rwatson commented 3 years ago

We should probably consider tests on a case-by-case basis, but in general, CheriBSD delivers SIGPROT for protection errors, rather than SIGSEGV. I wanted to clearly distinguish the two, since SIGBUS and SIGSEGV are confusing enough already -- but also because we imagined that different libraries/pieces of software might want to register different handlers for them. But there are probably cases that don't quite conform to that story as well, possibly even for good reasons.

rwatson commented 3 years ago

This gets us back to a case we talked about a couple of years ago, BTW, in which we define a spatial or temporal safety violation as being one in which there is aliasing between allocations, which differs from simply overflowing a buffer or performing a use-after-free. This allows us to not trigger faults when an allocation is overflowed into padding, or when use-after-free is into freed memory that has not yet been reused. We have strong architectural and software safety properties ensuring that's the case, but you won't get a SIGPROT in some situations, as a result.

rtadros125 commented 3 years ago

Austin's comment about CHERI issuing SIGPROT and the HARD pipeline issuing SSITH HARD exception, in addition to Brooks's 2nd comment about faulty tests and framework bugs, both explain the motivation behind us making the decisions in the current state of the document. We wanted to distinguish between bad test, or the OS on its own, and the SSITH CPU specific reaction.

Also, it is worth mentioning that this is the default scoring; configuring the run to consider any signal or kernel message as a weakness protection is an option.

I don't think there is a clear and obvious right answer to reach here, but rather a mutual consensus.

@austinhroach I am open to either consider all SEGFAULTS (and/or other non-SSITH signals) as "not enough protection" and extend this to buffer errors (also add more clarification to the document based on some of the points raised here), which IS the current state of the tool, or to meet to discuss other options in detail. Please lmk what you think.

austinhroach commented 3 years ago

@rtadros125 I think that a default behavior of considering segfaults to be "not enough protection" is the right call, and that that behavior should be extended to the buffer error tests.

rtadros125 commented 3 years ago

Roger that.

As detailed in #1122, we will review and make sure all the testing and scoring are consistent. Also, the document needs some work. Particularly, as Brooks, Jessica, and Alex have pointed out, the part of kernel signals seems too generic and confusing, and also the clarification that Austin made has to be explicit, and not just an implied motivation. Additionally, Brett and I will go through all of your comments again and make sure all of the other specific observations are reflected either into a tweak of a test/score or into an augmentation of the corresponding documentation.

rwatson commented 3 years ago

NB: I'm not sure I necessarily consider SIGSEGV to be an inappropriate signal to generate in all cases, even though I suspect we rarely or never will given the tests being used. The MMU is also an architectural protection mechanism, and the OS's use of the MMU is a completely fine baseline to take before extending to new protection mechanisms. For example, with CHERI you might sometimes choose to use a blend of capabilities and virtual memory to get some property -- such as guard pages within a CHERI-constrained address space sandboxing legacy code. If you hit a guard page within, you get SIGSEGV, but if you hit bounds trying to reach out, you get a CHERI fault. In Morello, there are some edge cases where segmentation exceptions take priority over CHERI protections, such as when you express an invalid virtual address -- this is done for microarchitectural reasons. In the context of this test suite, should any occur, we will analyse and determine whether we should count SIGSEGV as passing, which is fine. But, also for the reasons I identified above relating to padding/quarantine for spatial and temporal safety, we should be a little cautious about assuming that (a) failure to generate a signal and (b) generating a different signal mean non-protection.

austinhroach commented 3 years ago

NB: I'm not sure I necessarily consider SIGSEGV to be an inappropriate signal to generate in all cases, even though I suspect we rarely or never will given the tests being used. The MMU is also an architectural protection mechanism, and the OS's use of the MMU is a completely fine baseline to take before extending to new protection mechanisms. For example, with CHERI you might sometimes choose to use a blend of capabilities and virtual memory to get some property -- such as guard pages within a CHERI-constrained address space sandboxing legacy code. If you hit a guard page within, you get SIGSEGV, but if you hit bounds trying to reach out, you get a CHERI fault. In Morello, there are some edge cases where segmentation exceptions take priority over CHERI protections, such as when you express an invalid virtual address -- this is done for microarchitectural reasons. In the context of this test suite, should any occur, we will analyse and determine whether we should count SIGSEGV as passing, which is fine. But, also for the reasons I identified above relating to padding/quarantine for spatial and temporal safety, we should be a little cautious about assuming that (a) failure to generate a signal and (b) generating a different signal mean non-protection.

I completely agree with you. The default scoring behavior that I recommended above was not a general statement that SIGSEGV is an inappropriate way to report a security violation or that the MMU is an inappropriate resource for protecting memory. It was purely a pragmatic recommendation based on the architectures that currently exist in SSITH. For the reasons that you mentioned, it will be important to document why that choice was made and how to change the scoring behavior for architectures to which the test suite might be applied in the future that might differ in their reporting mechanisms or might rely at least partially on the MMU for protection.