diffblue / cbmc

C Bounded Model Checker
https://diffblue.github.io/cbmc
Other
847 stars 262 forks source link

RFC: Language mode for Rust #6219

Open danielsn opened 3 years ago

danielsn commented 3 years ago

RMC is a new Rust front-end for CBMC. Currently, it uses the C mode in the symbol-table. We propose adding a Rust mode to symbols. For now, this mode would have the same semantics as C, but would allow us to distinguish rust code from C code. This is particularly important as Rust support linking C code using a FFI, which we need to support in RMC.

Benefits

  1. Allows tools which consume CBMC output to know which language the initial source code was in. For example, when giving a failure trace that includes both Rust and C code, it would be valuable to the trace visualizer to know which language the statements originally came from
  2. Allows the backend to implement different semantics for the different languages. For example (maybe not the best one), one could imagine a case where one language wants unsigned addition to wrap around, while another wants it to saturate.
  3. Allows the addition of language specific checks. For example, Rust has areas of UB which are not defined as such in C - it may be useful to have checks which are only automatically inserted into the Rust side of a linked Rust/C binary.

Design considerations

Rust is not the only language that would benefit from this. Any new language front-end will likely see the same issues we are, and could benefit from a principled mechanism to alleviate them.

Links and documentation

https://github.com/diffblue/cbmc/blob/1ab5de1ac0893c7ca94ffe5463e2c7c4ee781266/src/util/symbol.h#L49

Currently, CBMC appears to have the following langauges:

https://github.com/diffblue/cbmc/blob/48893287099cb5780302fe9dc415eb6888354fd6/src/cbmc/cbmc_languages.cpp#L25-L35

martin-cs commented 3 years ago

If we are adding this then it would also be good to have these for Ada for the https://github.com/diffblue/gnat2goto/ front-end which is currently also using the C mode.

kroening commented 3 years ago

The expected pattern for this (Rust or Ada) is to set up a separate executable, with specialised command-line options. The executable then registers the relevant languages using the register_language API.

Look at jbmc as an exemplar.

martin-cs commented 3 years ago

Both of them use the json_symtab language which is already supported.

martin-cs commented 3 years ago

I guess more generally; for languages that are using the json_symtab interface, what should mode be set to? gnat2goto (and I think RMC) both lie and set the mode to "C".

peterschrammel commented 3 years ago

a. I'm with @martin-cs on this one: json_symtab shouldn't be a 'language', but it should have 'mode' property. b. The issue is that register_language registers a front-end implementation for handling certain input files, which is a slightly different concept than 'language'. c. Having a separate executable is preferable if language-specific options are required or the command line is adapted to seamlessly fit into the language-specific environment (like jbmc supporting the command line syntax of java), but this is somewhat orthogonal.

But back to points 1, 2, 3:

  1. I don't see any issue with adding ID_rust. Likely needs to do something about a and b.
  2. JBMC solves different semantics (e.g. int/float conversions that are different in Java vs C) in the frontend. Language-specific cases in the backend are not the right mechanism. IMO the correct solution is to add the required semantics as primitives to exprt (like @kroening has recently done for the remainder operations) and then translate them to the right primitives in the frontend.
  3. JBMC handles language-specifc checks by having a pass similar to goto_check, which requires 1 and c. This could be avoided by inserting them in the frontend - likely depends on the particular checks. Language-specific cases in goto-check are not the right mechanism IMO.
TGWDB commented 3 years ago

Taking into account the comments and discussion above, would a reasonable choice be as follows.

To support the handling of rust in displays and mixed input (i.e. a goto binary comprising of a mixture of C and rust:

  1. Extend the json_symtab "language" to have a mode property.
  2. Support (at least) json_symtab modes for ID_C and ID_Rust.

Then for semantic differences between the languages and different language checks:

  1. Extend cbmc's capabilities to handle these new semantics (as mentioned by @peterschrammel and done by @kroening for other semantics). These can be done for the semantics desired, but not tied to a specific language or language mode (i.e. one could use them in any language).
  2. Expect alternate front ends (e.g. gnat2goto and rmc) to handle the insertion of correct checks or goto programs that use the above new semantics.

Comments/feedback very welcome!

martin-cs commented 3 years ago

@TGWDB I agree with splitting this in two (getting the correct mode and then how and where you do language-specific things) and with the general principle that mode should be correct (i.e. we should have ID_C, ID_cpp, ID_java, ID_rust, ID_ada, etc.). However may I raise a few concerns with your plan:

  1. The mode is compared to the output of the id() method: https://github.com/diffblue/cbmc/blob/10ddca06abb2c5b13b59737650dd2cfd2565fe72/src/langapi/mode.cpp#L45 which for json_symtab is: https://github.com/diffblue/cbmc/blob/10ddca06abb2c5b13b59737650dd2cfd2565fe72/src/json-symtab-language/json_symtab_language.h#L41 it is not clear to me what this should be set to.

  2. mode is used in two ways, it is either checked directly in language specific bits of code (actually pretty rare; searching for ID_java or ID_cpp shows this) or it is used by src/langapi/mode.h to get the relevant languaget mostly for output via src/langapi/language_util.h.

  3. Sounds good.

  4. I get the principle and the whole point of a modular framework is so that we can have lots of different executables, so that is good. I have a slight concern about how we handle mixed language goto-programs, which is a desirable goal for both Ada and Rust ( @danielsn ? ).

As far as I can see there are two ways of resolving this and https://github.com/diffblue/cbmc/issues/6223 .

Simple fix approach : inherit from json_symtab_languaget to create rust_languaget and ada_languaget which have id()s of ID_rust and ID_ada respectively and implement from_type() and from_expr() etc. symtab2gb then loads all of these language front-ends and possibly the C front-end as well for good measure.

More involved but fuller solution : split languaget into two, one part for front-end, parsing and type checking and one part for language-specific support and output. Implement just the second part for ID_rust and ID_ada. Load the language-specific support and output for all languages in all tools. The front-end part stays the same in all tools, including symtab2gb which only needs json_symtab support.

HTH

chrisr-diffblue commented 3 years ago

I think I agree with most of the points raised by @peterschrammel @TGWDB and @martin-cs - in particular, I do quite like the idea of splitting languaget into explicit "front-end/reporting" and "output/reporting/mid-end/back-end" - that I think enables json_symtab to remain a fairly focused, language-agnostic "intermediate representation" rather than having bits of ad-hoc language specific knowledge bolted on.

martin-cs commented 3 years ago

@TGWDB any feel for a time-frame on this one? If it is soon then I won't merge #6233

danielsn commented 3 years ago

@martin-cs I also like the idea of splitting into front-end and back-end parts.

martin-cs commented 3 years ago

@TGWDB sorry to hassle but you are assigned as the owner for this. I need to https://github.com/diffblue/cbmc/issues/6223 resolved ASAP. https://github.com/diffblue/cbmc/pull/6233 will do it but I really don't want to merge it because it is a work-around. If you are going to resolve this in the next 24-48 hours then I won't merge #6233 .

Also @chrisr-diffblue and @peterschrammel for general interest.

TGWDB commented 3 years ago

@martin-cs I cannot promise resolution in 24-48 hours so I propose you go ahead with #6233.

martin-cs commented 3 years ago

@TGWDB thanks for the fast response.