Gtirb to ir - Githubissues

Megatomato commented 9 months ago

Provides a new frontend for the BASIL tool, taking dissasembly data from ddisasm and GTIRB instead of BAP.

To use, just use a .gts file as the argument to the tool, instead of the .adt, and it spits out a .bpl file like normal. It seeks to replicate the old interface as much as it can.

You can switch between the two interfaces by commenting the GTIRB LOGIC code and everything beneath it in the LoadBap function in RunUtils, and uncommenting all the code underneath BAP LOGIC. Also, be sure to change the type of loadBap from Program to BAPProgram, as the new interface directly outputs an IR program in one translation parse.

Broadly, it is about as good as the old interface, with the notable exception that GTIRB lacks function parameters and returns, so those are hardcoded as of now. Alistair has also lifted every example in the tests folder to a .gts.

ailrst commented 9 months ago

Note we have to drop https://github.com/UQ-PAC/bil-to-boogie-translator/pull/161/commits/5c913daefed173680818e27339ee94cd9eae5c49 before merging

katrinafyi commented 9 months ago

Hello, work has been done to prepare tooling for the GTIRB pipeline. Specifically, we need ddisasm to convert binaries to .gtirb, then gtirb-semantics processes this file to add ASLP semantics and produces the .gts file for Basil.

Since setting up these tools is a bit inconvenient, they have been packaged using the distribution-agnostic Nix package manager. The steps to set this up are documented here, but briefly:

Install the Nix package manager https://nixos.org/download (use multi-user if possible).

Run this command to set some settings:

printf '%s\n' "extra-experimental-features = nix-command flakes" "extra-trusted-users = $USER" | sudo tee -a /etc/nix/nix.conf

Install ddisasm. This should give you a ddisasm on your path. If asked, accept options to use the cache for a much faster build.
```
nix profile install github:katrinafyi/pac-nix#ddisasm
```
Install gtirb-semantics. This will give you gtirb-semantics-nix and extra tools. proto-json.py can convert a .gtirb or .gts file to readable JSON, and debug-gts.py takes a .gts file and annotates each line of semantics with the ARM instruction it comes from.
```
nix profile install github:katrinafyi/pac-nix#gtirb-semantics
```
gtirb-semantics-nix takes two arguments: input and output file. At the moment, this is a wrapper to call the underlying gtirb_semantics binary with some ASLP arguments. This is good to know in case the error messages are too confusing.
Everything should just work now.

Let me know if there are any bumps. This has been tested most on x86-64 Linuxes but might work elsewhere as well.

l-kent commented 9 months ago

@ailrst What's going on with that commit?

I don't quite follow what all the new filetypes are? ddisasm produces a .gtirb file from a specified binary, but what creates the .gts file, is it gtirb-semantics-nix? If the .gtirb file is just a predecessor for the .gts file and not needed as direct input to BASIL, nor is it conveniently human-readable, then we shouldn't have them in the repo.

Are the .ast.json files the most human readable output we get from the gtirb/ddisasm pipeline? That seems like it will be frustrating.

I don't understand why there are all these changes to the existing examples?

There seems to be a lot of unnecessary files committed to the repo in the examples folder that it would be good to clean up (although @Megatomato may want them locally). For a start, .ir or .bapir files should be removed, we don't need those in the repo.

I'm going to make it so that the .gts input is a separate option to the .adt input, rather than replacing it entirely.

ailrst commented 9 months ago

@ailrst What's going on with that commit?

Those couple of commits were just for testing and I intended to clean them up before merging.

The gtirb is just the gts without the semantics embedded (which is done by gtirb-semantics-nix) so we don't need to keep the gtirb around. I think Walter lifted a lot of the examples again to be able to compare the adt and gts semantics from binaries output by the same compiler version.

Are the .ast.json files the most human readable output we get from the gtirb/ddisasm pipeline? That seems like it will be frustrating.

Those only contain the instruction semantics, @katrinafyi has a script that prints the gtirb with more info, maybe this one https://github.com/UQ-PAC/gtirb-semantics/blob/main/scripts/proto-json.py ?

l-kent commented 9 months ago

Ah so those .ast.json files aren't the same as what that script produces? I had assumed those were the output of it. It might be good to produce the output of that script for all the tests then?

If you can clean up whatever needs to be done with those commits, that would be helpful.

Megatomato commented 9 months ago

I cleaned up the .gts files in the examples folder, since i don't think theres a need for the files to be both in the examples and the test folder. I also removed the .gtirb files from tests. If we want the better .ast.json, we can run the script over all the files in the tests folder.

katrinafyi commented 9 months ago

The debug-gts.py script in gtirb-semantics has been updated to include more information. Hopefully, this should be comparable in usefulness to BAP's BIR file. This can be downloaded and run as a standalone file, given the gtirb python module is installed. The script has messages to help with that.

Example output, showing an entry block in a function with two blocks.

    "hNC2Qf0TSqWyBaDfnnjspQ==": {
      "name": "_fini [entry] [1/2]",
      "address": 4264028,
      "code": [
        {
          "nop": []
        },
        {
          "stp x29, x30, [sp, #-16]!": [
            "Stmt_ConstDecl(Type_Bits(Expr_LitInt(\"64\")),Cse0__5,Expr_TApply(add_bits.0,[(Expr_LitInt(\"64\"))],[(Expr_Slices(Expr_Var(SP_EL0),[(Slice_LoWd(Expr_LitInt(\"0\"),Expr_LitInt(\"64\")))]));(Expr_LitBits(\"1111111111111111111111111111111111111111111111111111111111110000\"))]))",
            "Stmt_TCall(Mem.set.0,[(Expr_LitInt(\"8\"))],[(Expr_Var(Cse0__5));(Expr_LitInt(\"8\"));(Expr_LitInt(\"0\"));(Expr_Array(Expr_Var(_R),Expr_LitInt(\"29\")))])",
            "Stmt_TCall(Mem.set.0,[(Expr_LitInt(\"8\"))],[(Expr_TApply(add_bits.0,[(Expr_LitInt(\"64\"))],[(Expr_Var(Cse0__5));(Expr_LitBits(\"0000000000000000000000000000000000000000000000000000000000001000\"))]));(Expr_LitInt(\"8\"));(Expr_LitInt(\"0\"));(Expr_Array(Expr_Var(_R),Expr_LitInt(\"30\")))])",
            "Stmt_Assign(LExpr_Var(SP_EL0),Expr_Var(Cse0__5))"
          ]
        },
        {
          "mov x29, sp": [
            "Stmt_Assign(LExpr_Array(LExpr_Var(_R),Expr_LitInt(\"29\")),Expr_Slices(Expr_Var(SP_EL0),[(Slice_LoWd(Expr_LitInt(\"0\"),Expr_LitInt(\"64\")))]))"
          ]
        }
      ],
      "successors": {
        "qdCGKejJRhOhD1irsaaDvA== / _fini [2/2]": "Edge.Label(type=Edge.Type.Fallthrough, conditional=False, direct=True, )"
      }
    },

l-kent commented 9 months ago

Hopefully, this should be comparable in usefulness to BAP's BIR file.

It seems more comparable to the ADT files - possible for a human to read but it's more effort than is ideal unless you need to really dig into the details that the BIR doesn't expose. A pretty-printed version of the semantics would be extremely helpful.

l-kent commented 9 months ago

I've made it so that the BAP and GTIRB inputs are both enabled - which one is used now depends on whether a .gts or .adt file is passed as the -i / --input argument, replacing the old -a / --adt argument.

l-kent commented 9 months ago

https://github.com/UQ-PAC/bil-to-boogie-translator/blob/0ac0e7ea25daf13c0375d72ca96d00308d7d806e/src/main/scala/translating/GTIRBToIR.scala#L504

What does this comment refer to? What does the disassembler do incorrectly? What would it look like if this wasn't accounted for?

jumptable3 is an example that BAP doesn't lift properly (there is a jump table that BAP fails to detect is code and does not include) so I am very curious about any issues you've run into there. GTIRB seems to handle it better than BAP does at the very least.

katrinafyi commented 9 months ago

@l-kent:

It seems more comparable to the ADT files. [...] A pretty-printed version of the semantics would be extremely helpful.

Without parsing and re-printing the semantics (which would need Ocaml's libASL or repurposing Walter's ANTLR grammar), the best we can have is something in the style of format_adt.py:

Stmt_If(
  Expr_TApply(
    eq_bits.0,
    [(Expr_LitInt("1"))],
    [ (Expr_Field(Expr_Var(PSTATE), Z));
      (Expr_LitBits("1"))]),
  [ Stmt_Assign(
      LExpr_Array(LExpr_Var(_R), Expr_LitInt("2")),
      Expr_LitBits("0000000000000000000000000000000000000000000000000000000000000000"))],
  [],
  (else
    Stmt_Assign(
      LExpr_Array(LExpr_Var(_R), Expr_LitInt("2")),
      Expr_LitBits("0000000000000000000000000000000000000000000000000000000000000001"))))

Megatomato commented 9 months ago

What does this comment refer to? What does the disassembler do incorrectly? What would it look like if this wasn't accounted for?

Say we have a switch statement like:

int get_two() {
   return 2 
   }
int main() {

  switch (argc) {
    case 0:
       get_two(); 
       break;
    case 1: 
      get_two(); 
      break;
    default: 
      return 1
  } 
}

Inside the GTIRB CFG, it assigns multiple return edges to the function get_two(). As in, it will say that the function get_two() has a return edge to both this block:

case 0:
       get_two(); 
       break;

and the block containing case 1. Interestingly, it also does the same with the block containing the switch statement. i.e. switch (argc) { will have multiple branch edges going to the block(s) containing case 0, case 1, default etc.

Since a function does only return to one place, I assumed that this is an overapproximation on the dissasemblers part. It didn't really do anything wrong, it was just being safe. So in my code i replace both the "Multi-Return Edge" and the "Multi-Branch Edge" case with an indirect call, but gtirb still maintains references to the blocks where this indirect call goes (or could possibly go).

Hope this makes sense!

ailrst commented 9 months ago

@l-kent the fixed history is in https://github.com/UQ-PAC/bil-to-boogie-translator/tree/walter-gtirb-rebased it needs to be force pushed to this branch to properly remove all the stuff from the history so it doesn't bloat the repo too much, I'll leave that for you to do because I don't want to break your local copy if you have unpushed changes

l-kent commented 9 months ago

@katrinafyi

Without parsing and re-printing the semantics (which would need Ocaml's libASL or repurposing Walter's ANTLR grammar)

Parsing & re-printing is probably a good idea. I'll make a ticket for it and will do it at some point if no one else is up for it.

@Megatomato

Since a function does only return to one place

If a subroutine is called in multiple places, it can return to multiple places? A more explicit breakdown at the level of the GTIRB rather than the level of the C would help me to understand what it's doing.

So in my code i replace both the "Multi-Return Edge" and the "Multi-Branch Edge" case with an indirect call

This sounds like DDisasm is resolving the indirect call (in this case a jump to the jump table) to some extent already? I want to understand how well it does this.

@ailrst Thank you!

l-kent commented 9 months ago

@ailrst I have no unpushed changes right now. @Megatomato is it an issue for you if we force push the branch right now?

Megatomato commented 9 months ago

@l-kent no problem, i don't have any local changes

Megatomato commented 9 months ago

@l-kent

If a subroutine is called in multiple places, it can return to multiple places?

You're right. For some reason, it had never come up before this switch statement, so i assumed it was something special with it. But after testing some examples, this is just something gtrib does.

This sounds like DDisasm is resolving the indirect call (in this case a jump to the jump table) to some extent already? I want to understand how well it does this.

I took a look at their paper, it seems like they do resolve indirect calls by doing a Intra-Procedural analysis. They have some Datalog rules that save the value inside of a register, and if said register is used for an Indirect call, they backpropagate until they can resolve the address. This is why it works for jumptable3, but not something like the function pointer example since they do not follow calls. They explain this in section 5.1/5.2 in their paper. DataLogDisassemply_arXiv_2020.pdf

l-kent commented 9 months ago

Thanks, that resolution of indirect jumps seems useful. Can you make it so that those jumps are added back to the IR?

Megatomato commented 9 months ago

No problem. Do you want me to model them as a GoTo to all possible blocks? Ideally we would have a condition to jump to each block, like we do in an if statement, but I can't see what the condition would be on a gtirb level since in the semantics it's just an indirect call to R0.

l-kent commented 9 months ago

Yeah, a GoTo to all possible blocks is fine to start with. The correct condition for each branch would be that R0 (or whichever register is being called) equals the address of that block, but that isn't as important to start with.

Megatomato commented 9 months ago

We have a small problem. The .gts files need to be created at the same time as the .relf, otherwise any example that uses the global offset table in address calculation will have different address for functions/blocks, and will then fail on any assert checking memory.

Even if we update all of the relf's, this would now mean that the .adt files (which were created at the same time as the old relfs) would now be out of date compared to the new relfs and we would need to update those as well.

However, this would make the diff extremely large (as it was before the force push). Another solution to this problem is to have everybody lift the .GTS files locally, which would then update the relfs. Thoughts?

l-kent commented 9 months ago

I am already working on a solution for that, it's not much of an issue.

l-kent commented 9 months ago

Is there anything else we can use as the block labels apart from these base64 strings? It does not seem ideal to use those given that characters incompatible with Boogie appear in them, meaning those characters need to be removed or replaced, which could potentially cause issues.

l-kent commented 9 months ago

Do those base64 strings always end in "=="?

Megatomato commented 9 months ago

Base64 was the first encoding I had found that allowed me to do a round trip between gtirb's internal 16 byte ByteString and some regular string that I could print without some trouble. I had tried UTF-8 before, but it just gave me garbage, and wasn't able to do a round trip. Unless we can do another encoding, it might be a bit difficult.

Do those base64 strings always end in "=="?

It seems so, i haven't seen any that don't so far.

Megatomato commented 9 months ago

Java does seem to have a UUID class that is 16 bytes long, which looks somewhat promising...

l-kent commented 9 months ago

I've fixed the issue by making the incompatible characters in the Base64 strings be replaced uniquely, so my main concern now is just readability - the base64 labels are not easy to quickly distinguish.

l-kent commented 9 months ago

@katrinafyi When trying to run the debug_gts.py script from gtirb-semantics installed via nix, I get an error message because the python module 'typing_extensions', which the gtirb python module requires, is not installed. The error message that the script itself provides was actively unhelpful.

katrinafyi commented 9 months ago

@l-kent:

[...]

Thanks for your feedback, this has been fixed and the error message has been upgraded. edit: see https://github.com/katrinafyi/pac-nix to upgrade packages.

l-kent commented 9 months ago

@katrinafyi That command didn't work to install the new version - nix complained about conflicting files with the old version, and then told me to uninstall the old version and provided a command to do so that didn't work. I managed to figure out how to uninstall it despite that and now it all works after reinstalling it. Thank you!

katrinafyi commented 9 months ago

Sorry about that, I was too hasty. The Nix command line is indeed awkward.

l-kent commented 9 months ago

The other issue with the base64 labels as they currently exist is that they don't always match up with what is in the .json files produced by debug-gts.py, which makes using those files very difficult. This is partially due to the need to substitute characters, but there appears to be other issues beyond that - what are they doing differently?. We want a way of clearly identifying correspondence between the input and output. There can sometimes be an awful lot of completely empty blocks generated too - why do these exist?

In the test jumptable3/gcc, I don't follow where all these extra edges actually come from (the ones you thought were an overapproximation and previously replaced with an indirect call) as they don't appear to correspond to anything in the .json file? There aren't actually any indirect calls in jumptable3/gcc, I was thinking of jumptable3/clang which does have an indirect jump that DDisasm doesn't seem to resolve any better than BAP, so apologies for the confusion there.

Megatomato commented 9 months ago

The other issue with the base64 labels as they currently exist is that they don't always match up with what is in the .json files produced by debug-gts.py, which makes using those files very difficult. This is partially due to the need to substitute characters, but there appears to be other issues beyond that - what are they doing differently?.

Sometimes blocks don't match up because they have been split. For example, If there is a chained if statement in the middle of a block, or a write to the program counter(that hasn't been recognised as a jump). The block needs to be split up in order to make the write to the PC the new jump. This creates a new UUID, which isn't present in the json. I do agree that working with Base64 is quite annoying though. This maybe a better option would be to create new UUIDs during the creation of the .gts file, and propagate a map from new uuids -> internal gtirb bytestrings.

There can sometimes be an awful lot of completely empty blocks generated too - why do these exist?

This I have no idea. I'll look in their papers for an explanation, but some blocks just seem to be randomly empty.

In the test jumptable3/gcc, I don't follow where all these extra edges actually come from (the ones you thought were an overapproximation and previously replaced with an indirect call) as they don't appear to correspond to anything in the .json file? There aren't actually any indirect calls in jumptable3/gcc, I was thinking of jumptable3/clang which does have an indirect jump that DDisasm doesn't seem to resolve any better than BAP, so apologies for the confusion there.

This is interesting. When I go back to the commit before you made the new .gts files, this is what i get for jumptable3/gcc.

Jumptable3/clang still doesn't have any resolution, even in the old version. This seems to imply that this resolution might be a lot more unreliable than we thought. I'll try to look into this, but it just might be a ddisasm thing.

l-kent commented 9 months ago

This creates a new UUID, which isn't present in the json. Rather than creating a completely new UUID, it would be best to just give it the same label as its originator block, just with $_n (for the nth additional block created from the original block) after it, so we can maintain the correspondence.

When I go back to the commit before you made the new .gts files, this is what i get for jumptable3/gcc.

Ah, that must be because that was compiled with a different version of gcc, which produces different behaviour here. It is interesting that DDisasm can resolve an indirect call in some circumstances though. @ailrst what version of gcc did you use?

The problem I was having seems to just be that something was up with my repo and somehow the previous version of jumptable3/gcc was getting used in the tests? I don't understand how that happened but I'm no longer confused about the output from anything.

ailrst commented 9 months ago

aarch64-linux-gnu-gcc (Debian 13.2.0-2) 13.2.0

l-kent commented 9 months ago

I have 11.3.0, which explains the difference.

Megatomato commented 9 months ago

Rather than creating a completely new UUID, it would be best to just give it the same label as its originator block, just with $_n (for the nth additional block created from the original block) after it, so we can maintain the correspondence.

This was my idea initally as well, and I wish we could, but then it messes up the Decoding/Encoding from Base64. The main reason why these UUIDs are so annoying is because the jumps are filtered using GTIRBs Edge class, which takes two ByteStrings (one source and one target). Thus, when new blocks are created, edges need to be added, and these edges require the decoded base64 string.

l-kent commented 9 months ago

That seems like a problem that is possible to solve though, I'll work on that at some point.

ailrst commented 8 months ago

I have a pr to add more information to the gtirb output https://github.com/UQ-PAC/gtirb-semantics/pull/6

It now stores the label next to the block (if exists), and the disassembly and address next to the instructions.

This makes the semantics located at block["instructions"][i]["semantics"].

Megatomato commented 8 months ago

@l-kent Is there anything else we need to add/ review on this pr? I'm asking since it's my last day, so if you want any changes with the base gtirb functionality, today would probably be better.

Thank you.

l-kent commented 8 months ago

I don't need anything else from you, thanks.

l-kent commented 8 months ago

I'm still working on cleaning it up to get it to parse all of CNTLM, and solving the issue with correspondence between the UUIDs for split blocks, but that's almost done now.

l-kent commented 8 months ago

I'm ready to merge this. I've cleaned up some of the big cntlm files from this branch's history (just the ones committed in this branch).

@ailrst if you want to have a quick look before I merge feel free

UQ-PAC / BASIL

Gtirb to ir #161