correctcomputation / checkedc-clang

This is the primary development repository for 3C, a tool for automatically converting legacy C code to the Checked C extension of C, which aims to enforce spatial memory safety. This repository is a fork of Checked C's.
14 stars 5 forks source link

Add a mechanism to specify "open world" files #738

Open john-h-kastner opened 2 years ago

john-h-kastner commented 2 years ago

Allow the user of 3C to specify files and directories as "open world" instead of default "closed world". For closed world files, 3C would continue using its current assumptions. It has a complete program with all dependencies and clients available. Open world files drop this assumption. Dependencies might not be available (manifesting as undefined functions), and there might be more clients than are in the available source code (inferred types must accommodate unchecked callers without casting).

Much of the unique behavior enabled for open world files would be the same as can already be controlled with the -itypes-for-extern and infer-types-for-undefs flags. It might be possible to redefine them in terms of applying open or closed world behavior to the entire project in order to avoid duplicating similar logic. Alternatively, they could be deprecated in favor of whatever new mechanism is added to specify open world files.

Closed world

We assume 3C has access to all callers of all functions, and all uses of all structures, global variables, and typedefs. This matches 3C's current assumptions, so behavior should not change in this mode.

Open World

We cannot assume that 3C can see all function caller, function definitions, etc., so the analysis must be adjusted to permit arbitrary types in the missing code. This will act like some combination of the current -itypes-for-extern and -infer-types-for-undefs flags.

Example Use: converting a library header

In the libjpeg tutorial, the jpeglib.h header file was copied into a local include directory. 3C was then re-run with -infer-types-for-undefs to enable solving for and inserting checked types into the local copy of the header file even though the functions in the header were not defined.

After the changes proposed here, the flag passed to 3C would then be -open-world=./include to specify that files in ./include use the open world assumptions. All other files use the (default) closed world assumptions. When 3C is re-run, the open world assumption allows the undefined functions in jpeglib.h to solve to itypes as before. The changes are:

  1. Any undefined functions outside of ./include are still unchecked.
  2. Structure fields and global variables rewrite to itypes instead of fully checked types. This should allow the converted version of the header to more easily be used to compile the unconverted libjpeg source code. Previously these itypes needed to be added manually for this to be possible.
  3. Typedefs are completely unchecked. This is a problem and will substantially limit conversion in to_ppm.c due to local variables using the typedef remaining unchecked.

Example Use: Converting a single file in a project

The current approach is to enable -itypes-for-extern, convert with 3C, and then keep only the converted header files and the single source file you want to convert. Instead, the whole project could be specified as open world with -open-world=.. This would make all functions solve to itypes, all typedefs be unchecked, and all structure fields and global variables use itypes instead of checked types. The changes from current behavior are:

  1. Any undefined functions will also solve to itypes. Since -itypes-for-extern is only expected to be enabled in phase two of porting, it might be reasonable to assume that there are no undefined functions outside of any previously specified open world files, since these could be warnings or errors as discussed earlier.
mattmccutchen-cci commented 2 years ago

Good write-up. I have only a few things to add at this time:

A function can be rewritten to a checked type if it is internally checked, regardless of its callers, because we have access to all callers and can automatically insert casts.

Not always: insertion of a cast at a call site might be blocked by a macro (or, in theory, unwritable code, although that would be a very unusual project setup). I believe 3C handles this correctly now.

Soundness: As I think we discussed briefly at Friday's meeting, in an open world, any guarantee of spatial memory safety depends on an assumption that external code interacts with shared declarations in a way consistent with the declarations' Checked C annotations (as well as an assumption that the external code is internally spatially safe, which there's nothing we can do about). This applies to all kinds of program elements: not just undefined functions, but also structs and global variables that could be generated in project code and passed to external code or vice versa.

The common case is that the project uses a plain-C external library and needs Checked C declarations for the library that are consistent with its documented behavior, as currently described in the 3c -help for -infer-types-for-undefs. There can still be data flow in both directions (function parameters and returns, callbacks, etc.), but the point of view is that the library has already defined the API in some form and the project needs to describe that API in Checked C. However, theoretically, the reverse scenario could also happen: the project could be a library implemented in Checked C that supports plain-C clients (which would require some way of providing plain-C headers, such as erasable syntax). Then the project needs to ensure that the Checked C declarations used to check the safety of its implementation are consistent with the project's own API documentation that it provides to plain-C clients of the library. (I suppose this might be true "by construction" if the API documentation shows the declarations with Checked C annotations as a way of communicating assumptions to the reader, even if the reader is not using Checked C.)

In all of these cases, to give the user a practical way to ensure the project is spatially safe, we need a workflow to ensure that all shared declarations get reviewed. This generalizes what is currently described in #698 for undefined functions. It might be appropriate to broaden #698 to cover review of all kinds of shared declarations and keep this issue for discussion of the remaining topics.

Porting phase 2: If we run 3C on the whole project (as is our current practice), we don't actually have unknown callers and definitions; we just want to generate output that is compatible with all callers and definitions as they currently appear in the input, so that the user can copy part of the output into their main branch without introducing errors in the rest of the code. In principle, if the user knows in advance which parts of the code they are going to update, a more direct way to achieve the same end goal would be to mark the rest of the code as unwritable, assuming 3C has an option to do that (I will file an issue for that soon). The use of -itypes-for-extern in this scenario is really just a trick (or a hack, depending on your point of view) to do a single 3C run and then let the user look at the output and decide which part they want to copy (based on how much work it will be to fix the errors in that part), rather than having to do a separate 3C run for each file with all the other files marked unwritable (which would take quadratic time) before making the decision.

Assuming the user wants to copy file A (and maybe some header files?), conceivably there might be edge cases in which running 3C with all other files marked unwritable gives slightly better output for A than just running 3C with -itypes-for-extern. If that happens, it might be useful to for the user run 3C once with -itypes-for-extern to get a rough assessment of which file to work on next and then run 3C again with the other files marked unwritable to get the real output to copy.

Since running 3C on the whole project as part of phase 2 is a fundamentally different use scenario than a truly open world, we can expect that it would have slightly different needs that would ideally be accommodated by a separate option, although we can decide whether they are important enough to justify the work of actually maintaining a separate option. A few things I can think of so far:

  1. As you mentioned, undefined functions or global variables in writable code should ideally still generate an error or warning.
  2. If 3C does see that the definition and all callers of a program element X are sufficiently checked, it can go ahead and change the type of X to fully checked. Currently, we leave all of those changes to the very end of porting, when we run 3C with -itypes-for-extern turned off, but it could be nice to let them happen sooner.

On the other hand, if we run 3C on only one file at a time (to reduce the running time?), then that file is truly open-world from the point of view of that 3C run, except that we don't have to worry about soundness of declarations shared only with the rest of the project because we ultimately will be checking the entire project against those declarations.