CTSRD-CHERI / llvm-project

Fork of LLVM adding CHERI support
48 stars 41 forks source link

Question on manipulating purecap IR to remove capabilities from some pointers #478

Open olivierpierre opened 4 years ago

olivierpierre commented 4 years ago

Dear all,

This is not an issue per se but rather a question for experts in the CHERI compiler.

I understand that in hybrid mode, only pointers with the __capability attribute are represented as capabilities. In purecap mode, all pointers are represented as capabilities. We are thinking about a second version of the hybrid model in which all but selected pointers would be treated as capabilities -- we have a bit of motivation to do so. We could use the standard hybrid mode for that and annotate most pointers with __capability + correctly set the bounds/rights with functions like cheri_ptr() but obviously that represents an enormous amount of work -- at least if one wanted to do it manually.

Another approach may be to go with purecap and remove capability for selected pointers. I have been looking at the LLVM IR generated by the purecap and hybrid modes of the compiler for some simple examples and noticed that all the data/pointers are located/point to the address space 200.

As a naive attempt to remove the capabilities on selected pointers, I tried on a small example to edit manually the LLVM IR generated by the purecap frontend (obtained with cheribsd128purecap-clang -S -emit-llvm) and remove the addrspace(200) indicators in a minimal example. This is the C source code:

int main(int argc, char **argv) {
    int data = 1;
    int *ptr = &data;
    *ptr = 0;
    return 0;
}

IR generated by the purecap compiler:

/* ... */
  %data = alloca i32, align 4, addrspace(200)
  %ptr = alloca i32 addrspace(200)*, align 16, addrspace(200)
/* ... */
  store i32 1, i32 addrspace(200)* %data, align 4
  store i32 addrspace(200)* %data, i32 addrspace(200)* addrspace(200)* %ptr, align 16
  %0 = load i32 addrspace(200)*, i32 addrspace(200)* addrspace(200)* %ptr, align 16
  store i32 0, i32 addrspace(200)* %0, align 4
/* ... */

My (naive) attempt at removing capabilities:

  %data = alloca i32, align 4
  %ptr = alloca i32 *, align 8
/* ... */
  store i32 1, i32 * %data, align 4
  store i32 * %data, i32 ** %ptr, align 8
  %0 = load i32 *, i32 ** %ptr, align 8
  store i32 0, i32 * %0, align 4

When I try to create an executable from this IR I get the following error: Allocation instruction pointer not in the stack address space!

Is it because everything needs to be in address space 200 in the case of purecap mode?

More generally, do you think this approach is realistic? is it a better solution to focus on hybrid mode and to try to have most of the pointers and the data they point to point to/sit in address space 200?

Thank you very much

brooksdavis commented 4 years ago

Not being a compiler person I can't comment on the IR approach.

I can say that our pure-capability FreeBSD kernel is an example of a program with multiple types of pointers as a result of supporting legacy 64-bit userspace programs. In practice those userspace pointers are all turned into uint64_t's. What we'd actually prefer would be to annotate them with __ptr64 so we could retain types throughout the program (LLVM supports __ptr64 and __ptr32, but the apparent lack of GCC support has discouraged us from experimentation thus far). I suspect that extending this infrastructure is the best way forward. This code is greatly simplified by the fact that those pointers are not directly dereferencable and instead must be accessed via copyin/copyout/etc which means we don't need handle what those pointers mean at the ABI level. If you're actually using them then you'll need to define those semantics (even if that's just a non-NULL DCC).

jrtc27 commented 4 years ago

Like a lot of things, the answer is "it depends". Ultimately, if you want capabilities for your normal function calls and returns then you need to use a pure capability ABI, and if you want normal addresses for them then you need a hybrid ABI, at least at the IR level. Whereas pointers you get given, say, as function arguments in the IR really don't concern the backend and in theory both should just work. Where things get difficult is where the lines get blurred. For example, we know the stack is a capability, and so in a pure-capability ABI taking the address of a stack variable should give you a capability. Moreover the internals of LLVM sometimes have polymorphic instructions/intrinsics and sometimes they are monomorphic. The alloca instruction is monomorphic, and thus we have to choose based on the ABI which address space it returns pointers to, and so if you want the "wrong" type for your ABI you need to add instructions to convert between the two. As far as I know there's no technical reason why it couldn't be overloaded to support both other than that it is not something we need, and comes with the risk of accidentally generating the wrong code silently if an address space qualifier ever gets dropped (which is a continued battle against upstream code given address spaces are barely used upstream outside of niche targets and languages).

So, yes, I think this approach is one that should in theory work; really there is some amount of orthogonality here between the language-level pointers and the sub-language-level pointers, but we currently conflate the two. But depending on how far you want to take it it might be a significant undertaking (i.e. if you want things like an overloaded alloca; if you instead do it entirely in the Clang frontend and generate suitable conversion code then it is probably more manageable, but you may then find yourself limited by what the backends support).

olivierpierre commented 4 years ago

Thank you very much for your answers, I will have a look at __ptr64/32. Regarding the modification of the IR passed through the purecap compiler, I take it that it's much more complicated than simply editing the IR!