GrammaTech / gtirb-rewriting

Python API for rewriting GTIRB files
GNU General Public License v3.0
16 stars 3 forks source link

Adding support for 32-bit architectures #10

Open jkrshnmenon opened 9 months ago

jkrshnmenon commented 9 months ago

Hi,

I was looking into using gtirb-rewriting along with ddisasm on some 32 bit applications (x86 and arm), and I saw that ddisasm does support both these architectures, however, gtirb-rewriting does not.

I see that an ABI class exists that is intended for x86_32 architecture, but it doesn't seem to be used anywhere.

I wanted to ask how much effort you expect you might need to implement support for 32 bit x86 and ARM applications ? If it is a reasonable amount, I'd like to give it a shot if I can get some guidance on what needs to be done.

Looking forward to hearing from you.

jranieri-grammatech commented 9 months ago

Thanks for the interest! That class is used for 32-bit PE support, but as you've noticed there is not a corresponding ELF implementation and no 32-bit ARM support all.

Relatively speaking, adding a new ABI is straightforward:

For 32-bit ARM, there's a little bit more to do because we don't have any support for it yet:

Just a heads up, I'm currently inquiring internally about how to accept an outside contribution to this repository and it will probably require you to sign a CLA.

jkrshnmenon commented 9 months ago

Thank you for your response. Let me spend some time on this and see if I can get the 32-bit x86 ELF support ready first. I'll keep you posted on this thread about progress or issues.

jkrshnmenon commented 8 months ago

I've managed to implement the x86 32-bit ELF support and all the test-cases do pass. I've also made some progress on the ARM end.

The only thing that I'm missing from your list is the part about updating gtirb_rewriting/assembler/_mc_utils.py. I could not find any documentation about the LLVM instructions and would appreciate if you could point me the right direction.

The code is available here

jranieri-grammatech commented 8 months ago

That code is used to determine if a call instruction has a known target or is indirect. You can use mc-asm's command line interface to print out what LLVM instruction names get used for a given assembly input:

Here's an example for x86-64:

$ echo "call direct; call rax; call qword ptr [rax]" | python3 -m mcasm --syntax=intel --target=x86_64-pc-linux --filter=emit_instruction -
⚡️ emit_instruction
├── state (ParserState)
│   └── loc (SourceLocation)
│       ├── lineno = 1
│       └── offset = 1
├── inst (Instruction)
│   ├── desc (InstructionDesc)
│   │   ├── implicit_uses (list)
│   │   │   ├── [0] (Register)
│   │   │   │   ├── id = 58
│   │   │   │   ├── is_physical_register = True
│   │   │   │   └── name = 'RSP'
│   │   │   └── [1] (Register)
│   │   │       ├── id = 66
│   │   │       ├── is_physical_register = True
│   │   │       └── name = 'SSP'
│   │   └── is_call = True
│   ├── name = 'CALL64pcrel32'
│   ├── opcode = 661
│   └── operands (list)
│       └── [0] (SymbolRefExpr)
│           ├── location (SourceLocation)
│           │   ├── lineno = 1
│           │   └── offset = 6
│           ├── symbol (Symbol)
│           │   └── name = 'direct'
│           └── variant_kind = SymbolRefExpr.VariantKind.None_
├── data = b'\xe8\x00\x00\x00\x00'
└── fixups (list)
    └── [0] (Fixup)
        ├── kind_info (FixupKindInfo)
        │   ├── bit_size = 32
        │   ├── is_pc_rel = 1
        │   └── name = 'reloc_branch_4byte_pcrel'
        ├── offset = 1
        └── value (BinaryExpr)
            ├── lhs (SymbolRefExpr)
            │   ├── location (SourceLocation)
            │   │   ├── lineno = 1
            │   │   └── offset = 6
            │   ├── symbol (Symbol)
            │   │   └── name = 'direct'
            │   └── variant_kind = SymbolRefExpr.VariantKind.None_
            ├── opcode = BinaryExpr.Opcode.Add
            └── rhs (ConstantExpr)
                └── value = -4

⚡️ emit_instruction
├── state (ParserState)
│   └── loc (SourceLocation)
│       ├── lineno = 1
│       └── offset = 14
├── inst (Instruction)
│   ├── desc (InstructionDesc)
│   │   ├── implicit_uses (list)
│   │   │   ├── [0] (Register)
│   │   │   │   ├── id = 58
│   │   │   │   ├── is_physical_register = True
│   │   │   │   └── name = 'RSP'
│   │   │   └── [1] (Register)
│   │   │       ├── id = 66
│   │   │       ├── is_physical_register = True
│   │   │       └── name = 'SSP'
│   │   └── is_call = True
│   ├── name = 'CALL64r'
│   ├── opcode = 662
│   └── operands (list)
│       └── [0] (Register)
│           ├── id = 49
│           ├── is_physical_register = True
│           └── name = 'RAX'
├── data = b'\xff\xd0'
└── fixups = []

⚡️ emit_instruction
├── state (ParserState)
│   └── loc (SourceLocation)
│       ├── lineno = 1
│       └── offset = 24
├── inst (Instruction)
│   ├── desc (InstructionDesc)
│   │   ├── implicit_uses (list)
│   │   │   ├── [0] (Register)
│   │   │   │   ├── id = 58
│   │   │   │   ├── is_physical_register = True
│   │   │   │   └── name = 'RSP'
│   │   │   └── [1] (Register)
│   │   │       ├── id = 66
│   │   │       ├── is_physical_register = True
│   │   │       └── name = 'SSP'
│   │   ├── is_call = True
│   │   └── may_load = True
│   ├── name = 'CALL64m'
│   ├── opcode = 659
│   └── operands (list)
│       ├── [0] (Register)
│       │   ├── id = 49
│       │   ├── is_physical_register = True
│       │   └── name = 'RAX'
│       ├── [1] = 1
│       ├── [2] (Register)
│       ├── [3] = 0
│       └── [4] (Register)
├── data = b'\xff\x10'
└── fixups = []

You can see how there's different LLVM instruction names despite it being the same assembly mnemonic. My hope is that 32-bit ARM also has different LLVM instruction names for direct calls versus indirect calls, but I'm only really familiar with 64-bit ARM.

jranieri-grammatech commented 8 months ago

Another thing I've noticed is that there'll probably need to be a change to mc-asm to expose the isa-specific MCExprs used in fixups. For example, there's important data missing when parsing this assembly:

        MOVS r0, #:upper8_15:#foo
        LSLS r0, r0, #8
        ADDS r0, #:upper0_7:#foo
        LSLS r0, r0, #8
        ADDS r0, #:lower8_15:#foo
        LSLS r0, r0, #8
        ADDS r0, #:lower0_7:#foo

... but I'm not familiar enough with 32-bit ARM to know if these relocations are commonly used or not.

adrianherrera commented 2 months ago

Hello! Was looking at using GTIRB-rewriting on some 32-bit binaries and stumbled across this thread.

Reading through this thread, it seems that ARM32 is not 100% implemented. But it seems like x86 is? If so, could we please merge in the x86 support? That would be grand!

jranieri-grammatech commented 1 month ago

@jkrshnmenon, is there any update on this? I can dig up a CLA for you to sign to get at least the 32-bit x86 support merged if you think that's ready.

jkrshnmenon commented 1 month ago

@jranieri-grammatech Apologies for the lack of communication here. But I think the 32-bit x86 support is ready to get merged. I can try running more tests some time soon, but unfortunately I'm a bit busy until the end of July. I can sign the CLA any time though.