Should we write the C front-end as a "standard" BAP lifter?

codyroux commented 3 years ago

Currently we write the C front end as a pattern matching over the C syntax, which directly writes programmatic semantics for each construct (expressions, sequences, calls, gotos...).

But this process is very similar to the "stardard" uplifters for various binary formats, which dispatch over opcodes, calling specific primus lisp functions by name, drawn from a foo.lisp file.

It sounds like a viable way forward, enabling us to use all the Primus Lisp goodies, would be to rewrite the front-end in this style, and have a c.lisp file with the semantics of C constructs.

This issue is dedicated to discussing the pros and cons, and the implementation details of this approach.

codyroux commented 3 years ago

Here are two examples of this approach:

https://github.com/BinaryAnalysisPlatform/bap/tree/master/plugins/arm

and

https://github.com/BinaryAnalysisPlatform/bap/tree/master/plugins/ghidra

I'm not sure how the dispatch itself works though: who decides which lisp opcode is being called? Is this written in C++?

ivg commented 3 years ago

Better links would be to the lisp files, e.g., aarch64.lisp and pcode.lisp, which demonstrate how semantics could be described. The translation is done rather simply, when we enable the Primus Lifter, e.g.,

  let translate_ops_to_args () =
    KB.Rule.(begin
        declare "translate-ops-to-lisp-args" |>
        require Insn.slot |>
        provide Lisp.Semantics.args |>
        comment "translates instruction operands to lisp arguments"
      end);
    KB.promise Lisp.Semantics.args @@ fun this ->
    KB.collect Insn.slot this >>=? fun insn ->
    Theory.instance () >>= Theory.require >>= fun theory ->
    Theory.Label.target this >>= fun target ->
    args_of_ops theory target insn >>| Option.some

  let translate_opcode_to_name () =
    KB.Rule.(begin
        declare "translate-opcode-to-lisp-name" |>
        require Insn.slot |>
        provide Lisp.Semantics.name |>
        comment "translates opcode to the lisp definition name"
      end);
    KB.promise Lisp.Semantics.name @@ fun this ->
    KB.collect Insn.slot this >>|? fun insn ->
    let name = KB.Name.create ~package:(Insn.encoding insn) (Insn.name insn) in
    Some name

Where arg_of_ops translates instruction opcodes to semantic values,

  let args_of_ops (module CT : Theory.Core) target insn =
    let bits = Theory.Target.bits target in
    let word = Theory.Bitv.define bits in
    let bitv x = Bitvec.(int64 x mod modulus bits) in
    Insn.ops insn |> Array.to_list |>
    List.map ~f:(function
        | Op.Fmm _ -> CT.unk word >>| Theory.Value.forget
        | Op.Imm x ->
          let x = bitv (Imm.to_int64 x) in
          CT.int word x >>| Theory.Value.forget >>| fun v ->
          KB.Value.put Lisp.Semantics.static v (Some x)
        | Op.Reg v ->
          let name = Reg.name v in
          let reg = match Theory.Target.var target name with
            | None -> Theory.Var.forget @@ Theory.Var.define word name
            | Some v -> v in
          CT.var reg >>| fun v ->
          KB.Value.put Lisp.Semantics.symbol v (Some name)) |>
    KB.List.all

This is how we go from the value of type Bap.Std.Disasm_expert.Basic.Insn.t, which is a machine-specific instruction representation, to the machine-independent Bap.Std.Insn.t, which encodes the semantics of that instruction. It works because the Primus Lisp lifter stores a promise that for a given name and a list of arguments provides the semantics, more specifically,

$ bap list rules | grep primus-lisp-semantics -A8
primus-lisp-semantics ::=
  bap:primus-lisp-program
  core:unit-target
  core:unit-source
  core:label-unit
  bap:lisp-args
  bap:lisp-name
  -----------------------
  core:semantics

Obviously, if we employ the same approach for the FrontC lifter we will have a C program as the input, e.g., encoded with Cabs.body. To lift the C program using Primus Lisp lifter we need to traverse the C AST and for each constructor create a program object, set its Lisp name property to the name of the constructor (with the c package). Then for each argument of the constructor, we shall recurse and get their semantics and then pass it as arguments. (Yes, no promises involved, we will recurse over the existing AST).

Lets make the discussion less abstract and suggest some simple C program, e.g.,

WHILE (VARIABLE "halt", COMPUTATION (UNARY (POSINCR, VARIABLE "counter")))

which corresponds to the following C program,

while (halt) counter++;

Assuming that we have semantics for C defined in Lisp, e.g.,

(defun c:while (cnd body) 
   (while cnd body))

(defun c:posincr (var)
  (set$ var (+ var 1))

Then our frontend will look something like this,

let rec stmt = function
  | WHILE (cnd,body) -> 
    let* cnd = exp cnd in
    let* body = stmt body in
    eval "while" [cnd; body]
  | COMPUTATION e -> exp e
  | _ -> failwith "unimplemented"
and exp = function
  | UNARY (POSINCR, x) -> 
    let* x = exp x in
    eval "posincr" [x]
  | VARIABLE v -> var v
  | _ -> failwith "unimplemented"

where eval is something like this,

let eval name args = 
  let* lbl = Theory.Label.fresh in
  let name = KB.Name.create ~package:"c" name in
  let* () = KB.provide Lisp.Semantics.name lbl (Some name) in
  let* () = KB.provide Lisp.Semantics.args lbl (Some args) in 
  KB.collect Theory.Semantics.slot

Et voila!)

Well, it is voila for starters only, I would envision that we will need to do something with scopes and variables, but since they exist on the AST level, we should be able to handle them without extra hassle. I think you already have some code for that which most likely could be reused.

So thoughts?

draperlaboratory / VIBES

Should we write the C front-end as a "standard" BAP lifter? #95